GNU bug report logs - #40634
Massive pattern list handling with -E format seems very slow since 2.28.

Previous Next

Package: grep;

Reported by: fryasu <at> yahoo.co.jp

Date: Wed, 15 Apr 2020 02:21:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 40634 in the body.
You can then email your comments to 40634 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Wed, 15 Apr 2020 02:21:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to fryasu <at> yahoo.co.jp:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 15 Apr 2020 02:21:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: fryasu <at> yahoo.co.jp
To: "bug-grep <at> gnu.org" <bug-grep <at> gnu.org>
Subject: Massive pattern list handling with -E format seems very slow since
 2.28.
Date: Wed, 15 Apr 2020 09:26:55 +0900 (JST)
Hi,

Massive pattern list handling with -E format seems very
slow, since grep 2.28.

Conversion from -E format to -F format may have problem
about performance.

When the processing time is measured by the script below,
the result isobviously different between 2.28 and 2.27.

----
#!/bin/bash
export LC_ALL=C

rm -f grep-patterns.txt
for i in {1..2000}; do
     pat=$(echo -n "$i" | sha1sum | cut -f1 -d ' ')
     echo -e "$pat$pat(\$|$pat)" >> grep-patterns.txt
done

echo executing grep...
time grep -E -v -m1 -f grep-patterns.txt /dev/null
----

The following is the results in my PC with fedora's RPM.
https://koji.fedoraproject.org/koji/packageinfo?packageID=1023

- result with grep 2.28

  real 0m11.087s / user 0m11.027s / sys 0m0.037s

- result with grep 2.27

  real 0m0.144s / user 0m0.116s / sys 0m0.027s

With also recent 3.4, result is same.


I hope you find it useful.


regards,





Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Thu, 16 Apr 2020 06:58:01 GMT) Full text and rfc822 format available.

Message #8 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 40634 <at> debbugs.gnu.org
Cc: fryasu <at> yahoo.co.jp
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Thu, 16 Apr 2020 15:56:58 +0900
+ grep-2.2/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
grep-2.2/src/grep: invalid option -- 'm'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
real 0.00
user 0.00
sys 0.00
+ grep-2.3/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
grep-2.3/src/grep: invalid option -- 'm'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
real 0.00
user 0.00
sys 0.00
+ grep-2.4/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
grep-2.4/src/grep: invalid option -- 'm'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
real 0.00
user 0.00
sys 0.00
+ grep-2.4.1/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
grep-2.4.1/src/grep: invalid option -- 'm'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
real 0.00
user 0.00
sys 0.00
+ grep-2.4.2/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
grep-2.4.2/src/grep: invalid option -- 'm'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
real 0.00
user 0.00
sys 0.00
+ grep-2.5.4/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 13.58
user 13.43
sys 0.14
+ grep-2.6/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.95
user 3.51
sys 0.42
+ grep-2.6.1/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.97
user 3.86
sys 0.11
+ grep-2.6.2/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.98
user 3.69
sys 0.28
+ grep-2.6.3/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.99
user 3.94
sys 0.04
+ grep-2.7/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.93
user 3.68
sys 0.24
+ grep-2.8/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.89
user 3.83
sys 0.05
+ grep-2.9/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.92
user 3.63
sys 0.27
+ grep-2.10/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.93
user 3.65
sys 0.27
+ grep-2.11/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.98
user 3.89
sys 0.08
+ grep-2.12/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.95
user 3.87
sys 0.06
+ grep-2.13/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.97
user 3.85
sys 0.11
+ grep-2.14/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 4.01
user 3.91
sys 0.09
+ grep-2.15/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 4.05
user 3.99
sys 0.05
+ grep-2.16/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.91
user 3.50
sys 0.40
+ grep-2.17/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.98
user 3.94
sys 0.03
+ grep-2.18/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 3.98
user 3.87
sys 0.10
+ grep-2.19/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.54
user 0.50
sys 0.03
+ grep-2.20/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.62
user 0.57
sys 0.04
+ grep-2.21/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.43
user 0.41
sys 0.02
+ grep-2.22/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.18
user 0.16
sys 0.01
+ grep-2.23/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.18
user 0.13
sys 0.04
+ grep-2.24/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.18
user 0.13
sys 0.04
+ grep-2.25/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.18
user 0.16
sys 0.01
+ grep-2.26/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.18
user 0.14
sys 0.04
+ grep-2.27/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.17
user 0.15
sys 0.02
+ grep-2.28/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.22
user 7.14
sys 0.07
+ grep-3.0/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.26
user 7.11
sys 0.14
+ grep-3.1/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.22
user 6.88
sys 0.33
+ grep-3.2/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.20
user 7.04
sys 0.15
+ grep-3.3/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.24
user 7.17
sys 0.07
+ grep-3.4/src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.13
user 6.49
sys 0.63

It seems to a lot of time is spent in dfa.c:replace().
It was added at d6df3873c7abc243683d0e8fccbfde4e76f23e53 in gnulib.





Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Thu, 16 Apr 2020 16:32:03 GMT) Full text and rfc822 format available.

Message #11 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 40634 <at> debbugs.gnu.org
Cc: fryasu <at> yahoo.co.jp
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Thu, 16 Apr 2020 09:31:32 -0700
On 4/15/20 11:56 PM, Norihiro Tanaka wrote:

> It seems to a lot of time is spent in dfa.c:replace().
> It was added at d6df3873c7abc243683d0e8fccbfde4e76f23e53 in gnulib.

It would be pretty drastic to revert that patch. Is there some better way to 
move forward?




Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Thu, 16 Apr 2020 22:54:02 GMT) Full text and rfc822 format available.

Message #14 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: fryasu <at> yahoo.co.jp, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 17 Apr 2020 07:53:12 +0900
On Thu, 16 Apr 2020 09:31:32 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 4/15/20 11:56 PM, Norihiro Tanaka wrote:
> 
> > It seems to a lot of time is spent in dfa.c:replace().
> > It was added at d6df3873c7abc243683d0e8fccbfde4e76f23e53 in gnulib.
> 
> It would be pretty drastic to revert that patch. Is there some better way to move forward?

I have had no idea to solve the problem yet.  If we revert it, bug#33357
will come back.





Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Thu, 16 Apr 2020 23:01:01 GMT) Full text and rfc822 format available.

Message #17 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: fryasu <at> yahoo.co.jp, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Thu, 16 Apr 2020 16:00:29 -0700
On 4/16/20 3:53 PM, Norihiro Tanaka wrote:

> I have had no idea to solve the problem yet.  If we revert it, bug#33357
> will come back.

Yes, I'd rather not revert if we can help it.

My own thought was to not analyze the regular expression if we discover that the 
input is empty. :-)




Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 17 Apr 2020 00:36:01 GMT) Full text and rfc822 format available.

Message #20 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: fryasu <at> yahoo.co.jp, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 17 Apr 2020 09:35:36 +0900
On Thu, 16 Apr 2020 16:00:29 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> 
> > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > will come back.
> 
> Yes, I'd rather not revert if we can help it.
> 
> My own thought was to not analyze the regular expression if we discover that the input is empty. :-)

Now, I have a idea, it is that we build indexes of epsilon nodes
including in follows before remove epsilon nodes.





Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 17 Apr 2020 01:25:01 GMT) Full text and rfc822 format available.

Message #23 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <40634 <at> debbugs.gnu.org>
Cc: fryasu <at> yahoo.co.jp, Paul Eggert <eggert <at> cs.ucla.edu>, bug-gnulib <at> gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 17 Apr 2020 10:24:42 +0900
[Message part 1 (text/plain, inline)]
On Fri, 17 Apr 2020 09:35:36 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> 
> On Thu, 16 Apr 2020 16:00:29 -0700
> Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> 
> > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > 
> > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > will come back.
> > 
> > Yes, I'd rather not revert if we can help it.
> > 
> > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> 
> Now, I have a idea, it is that we build indexes of epsilon nodes
> including in follows before remove epsilon nodes.


I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
[0001-dfa-build-auxiliary-indexes-before-remove-epsilon-cl.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 17 Apr 2020 15:23:02 GMT) Full text and rfc822 format available.

Message #26 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 40634 <at> debbugs.gnu.org
Cc: fryasu <at> yahoo.co.jp, Paul Eggert <eggert <at> cs.ucla.edu>, bug-gnulib <at> gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Sat, 18 Apr 2020 00:22:26 +0900
[Message part 1 (text/plain, inline)]
On Fri, 17 Apr 2020 10:24:42 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> 
> On Fri, 17 Apr 2020 09:35:36 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> 
> > 
> > On Thu, 16 Apr 2020 16:00:29 -0700
> > Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > 
> > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > 
> > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > will come back.
> > > 
> > > Yes, I'd rather not revert if we can help it.
> > > 
> > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > 
> > Now, I have a idea, it is that we build indexes of epsilon nodes
> > including in follows before remove epsilon nodes.
> 
> 
> I wrote fix for the bug, but it will be slower then at grep 2.27 yet.

It was improved previous patch.
[0001-dfa-build-auxiliary-indexes-before-remove-epsilon-cl_old.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Sat, 18 Apr 2020 22:42:01 GMT) Full text and rfc822 format available.

Message #29 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 40634 <at> debbugs.gnu.org,
 fryasu <at> yahoo.co.jp
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, bug-gnulib <at> gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Sun, 19 Apr 2020 07:41:49 +0900
[Message part 1 (text/plain, inline)]
On Sat, 18 Apr 2020 00:22:26 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> 
> On Fri, 17 Apr 2020 10:24:42 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> 
> > 
> > On Fri, 17 Apr 2020 09:35:36 +0900
> > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > 
> > > 
> > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > 
> > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > 
> > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > will come back.
> > > > 
> > > > Yes, I'd rather not revert if we can help it.
> > > > 
> > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > 
> > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > including in follows before remove epsilon nodes.
> > 
> > 
> > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> 
> It was improved previous patch.

Sorry, correct patch is here.
[0001-dfa-build-auxiliary-indexes-before-remove-epsilon-cl.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Sun, 19 Apr 2020 02:11:01 GMT) Full text and rfc822 format available.

Message #32 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 40634 <at> debbugs.gnu.org
Cc: fryasu <at> yahoo.co.jp, Paul Eggert <eggert <at> cs.ucla.edu>, bug-gnulib <at> gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Sun, 19 Apr 2020 11:10:26 +0900
[Message part 1 (text/plain, inline)]
On Sun, 19 Apr 2020 07:41:49 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> 
> On Sat, 18 Apr 2020 00:22:26 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> 
> > 
> > On Fri, 17 Apr 2020 10:24:42 +0900
> > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > 
> > > 
> > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > 
> > > > 
> > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > > 
> > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > 
> > > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > > will come back.
> > > > > 
> > > > > Yes, I'd rather not revert if we can help it.
> > > > > 
> > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > > 
> > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > including in follows before remove epsilon nodes.
> > > 
> > > 
> > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > 
> > It was improved previous patch.
> 
> Sorry, correct patch is here.

I made the previous patch even simpler.

before:

$ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 7.24
user 7.14
sys 0.09

after:

$ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
real 0.62
user 0.52
sys 0.10
[0001-dfa-use-backword-set-in-removal-of-epsilon-closure.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Mon, 20 Apr 2020 14:55:02 GMT) Full text and rfc822 format available.

Message #35 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: fryasu <at> yahoo.co.jp
To: "40634 <at> debbugs.gnu.org" <40634 <at> debbugs.gnu.org>
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Mon, 20 Apr 2020 17:44:14 +0900 (JST)
Hi, 

I immediately tried the attached patch, and was able
to confirmthat grep perfomance is improved.


Our work to find thousands of patterns in a file with
tens of thousand lines is going extremely smoothly.


I'm glad for your quick response.

Thank you.





Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 11 Sep 2020 12:49:01 GMT) Full text and rfc822 format available.

Message #38 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 GNU grep developers <grep-devel <at> gnu.org>
Cc: fryasu <at> yahoo.co.jp, "bug-gnulib <at> gnu.org List" <bug-gnulib <at> gnu.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 11 Sep 2020 14:47:49 +0200
On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> On Sun, 19 Apr 2020 07:41:49 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > On Sat, 18 Apr 2020 00:22:26 +0900
> > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> >
> > >
> > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > > >
> > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > >
> > > > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > > > will come back.
> > > > > >
> > > > > > Yes, I'd rather not revert if we can help it.
> > > > > >
> > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > > >
> > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > including in follows before remove epsilon nodes.
> > > >
> > > >
> > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > >
> > > It was improved previous patch.
> >
> > Sorry, correct patch is here.
>
> I made the previous patch even simpler.
>
> before:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 7.24
> user 7.14
> sys 0.09
>
> after:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 0.62
> user 0.52
> sys 0.10

Thank you for this patch. I have rebased and made minor syntactic changes.
I'll push it to gnulib soon, if not today, then by Monday.

I am considering creating a test case in grep, but it feels too tight
to be feasible: I would use a relative perf test, requiring that a
passing test incur a perf cost of less than say 100x. Here's the
beginnings of my attempt (note: this is just an outline -- obviously
would not rely on having "time" in path or as a shell builtin):

gen()
{
  local n=$1
  local i=1
  while : ; do
    local pat=$(printf $i | sha1sum | cut -d' ' -f1)
    printf '%s\n' "$pat$pat(\$|$pat)"
    i=$(expr $i + 1)
    test $i = $n && break
  done
}

gen 4000 > pats-4000
head -400 pats-4000 > pats-400

# With fixed code, that a 10x input size increase (n=400 to 4000)
# induces a 40x runtime increase: .05 -> 2.0s
# Just prior to this change, it's 150x: 0.2 -> 30s

env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null




Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 11 Sep 2020 13:18:01 GMT) Full text and rfc822 format available.

Message #41 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 GNU grep developers <grep-devel <at> gnu.org>
Cc: fryasu <at> yahoo.co.jp, "bug-gnulib <at> gnu.org List" <bug-gnulib <at> gnu.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 11 Sep 2020 15:17:29 +0200
[Message part 1 (text/plain, inline)]
On Fri, Sep 11, 2020 at 2:47 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > On Sun, 19 Apr 2020 07:41:49 +0900
> > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > On Sat, 18 Apr 2020 00:22:26 +0900
> > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > > Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > > Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > > > > >
> > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > > >
> > > > > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > > > > will come back.
> > > > > > >
> > > > > > > Yes, I'd rather not revert if we can help it.
> > > > > > >
> > > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > > > >
> > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > > including in follows before remove epsilon nodes.
> > > > >
> > > > >
> > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > > >
> > > > It was improved previous patch.
> > >
> > > Sorry, correct patch is here.
> >
> > I made the previous patch even simpler.
> >
> > before:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 7.24
> > user 7.14
> > sys 0.09
> >
> > after:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 0.62
> > user 0.52
> > sys 0.10
>
> Thank you for this patch. I have rebased and made minor syntactic changes.
> I'll push it to gnulib soon, if not today, then by Monday.
>
> I am considering creating a test case in grep, but it feels too tight
> to be feasible: I would use a relative perf test, requiring that a
> passing test incur a perf cost of less than say 100x. Here's the
> beginnings of my attempt (note: this is just an outline -- obviously
> would not rely on having "time" in path or as a shell builtin):
>
> gen()
> {
>   local n=$1
>   local i=1
>   while : ; do
>     local pat=$(printf $i | sha1sum | cut -d' ' -f1)
>     printf '%s\n' "$pat$pat(\$|$pat)"
>     i=$(expr $i + 1)
>     test $i = $n && break
>   done
> }
>
> gen 4000 > pats-4000
> head -400 pats-4000 > pats-400
>
> # With fixed code, that a 10x input size increase (n=400 to 4000)
> # induces a 40x runtime increase: .05 -> 2.0s
> # Just prior to this change, it's 150x: 0.2 -> 30s
>
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null

And here is the adjusted patch:
[dfa.c-epsilon-node-removal-speedup.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Fri, 11 Sep 2020 23:02:01 GMT) Full text and rfc822 format available.

Message #44 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 GNU grep developers <grep-devel <at> gnu.org>
Cc: fryasu <at> yahoo.co.jp, Gnulib bugs <bug-gnulib <at> gnu.org>, 40634 <at> debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Fri, 11 Sep 2020 16:01:01 -0700
[Message part 1 (text/plain, inline)]
> And here is the adjusted patch:

Hold on, that looks like a cleanup of the April 18 patch posted here:

https://bugs.gnu.org/40634#26

But there's a later patch dated April 19, which Norihiro Tanaka said should be 
more correct and simpler:

https://bugs.gnu.org/40634#32

I'll try to take a look at the later patch.
[0001-dfa-minor-improvements-of-previous-change.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Sat, 12 Sep 2020 06:43:01 GMT) Full text and rfc822 format available.

Message #47 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: fryasu <at> yahoo.co.jp, Gnulib bugs <bug-gnulib <at> gnu.org>, 40634 <at> debbugs.gnu.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, GNU grep developers <grep-devel <at> gnu.org>
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Sat, 12 Sep 2020 08:41:51 +0200
On Sat, Sep 12, 2020 at 1:01 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > And here is the adjusted patch:
>
> Hold on, that looks like a cleanup of the April 18 patch posted here:
>
> https://bugs.gnu.org/40634#26
>
> But there's a later patch dated April 19, which Norihiro Tanaka said should be
> more correct and simpler:
>
> https://bugs.gnu.org/40634#32
>
> I'll try to take a look at the later patch.

Oh! Glad you spotted that.




Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Mon, 14 Sep 2020 02:04:01 GMT) Full text and rfc822 format available.

Message #50 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: fryasu <at> yahoo.co.jp, Gnulib bugs <bug-gnulib <at> gnu.org>, 40634 <at> debbugs.gnu.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, GNU grep developers <grep-devel <at> gnu.org>
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Sun, 13 Sep 2020 19:03:33 -0700
[Message part 1 (text/plain, inline)]
On 9/11/20 11:41 PM, Jim Meyering wrote:

>> https://bugs.gnu.org/40634#32
>>
>> I'll try to take a look at the later patch.
> 
> Oh! Glad you spotted that.

I took a look and the basic idea sounds good though I admit I did not check 
every detail. While looking into it I found some opportunities for improvements, 
plus I found what appear to be some longstanding bugs in the area, one of which 
causes a grep test failure on Solaris (and I suspect the bug is also on 
GNU/Linux but the grep tests don't catch it). I installed the attached patches 
into Gnulib, updated grep to point to the new Gnulib version, and added a note 
in grep's NEWS file about this.

Patch 1 is what Norihiro Tanaka proposed in Bug#40634#32, except I edited the 
commit message. Patch 2 consists of minor cleanups and performance tweaks for 
Patch 1. (Patches 3 and 4 are omitted as they were installed by others into 
Gnulib at about the same time I was installing these.) Patch 5 fixes a 
dfa-heap-overrun failure on Solaris that appears to be a longstanding bug 
exposed by Patch 1 when running on Solaris. Patch 6 merely cleans up code near 
Patch 5. Patch 7 fixes the use of an uninitialized constraint, which I 
discovered while debugging Patch 5 under Valgrind; this also appears to be a 
longstandiung bug.

Coming up with test cases for all these bugs would be pretty tricky, unfortunately.
[0001-dfa-use-backward-set-in-removal-of-epsilon-closure.patch (text/x-patch, attachment)]
[0002-dfa-epsilon-closure-tweaks-Bug-40634.patch (text/x-patch, attachment)]
[0005-dfa-fix-dfa-heap-overrun-failure.patch (text/x-patch, attachment)]
[0006-dfa-assume-C99-in-reorder_tokens.patch (text/x-patch, attachment)]
[0007-dfa-avoid-use-of-uninitialized-constraint.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#40634; Package grep. (Mon, 14 Sep 2020 14:15:01 GMT) Full text and rfc822 format available.

Message #53 received at 40634 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: fryasu <at> yahoo.co.jp, Gnulib bugs <bug-gnulib <at> gnu.org>, 40634 <at> debbugs.gnu.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, GNU grep developers <grep-devel <at> gnu.org>
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Mon, 14 Sep 2020 07:14:14 -0700
On Sun, Sep 13, 2020 at 7:03 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 9/11/20 11:41 PM, Jim Meyering wrote:
> >> https://bugs.gnu.org/40634#32
> >>
> >> I'll try to take a look at the later patch.
> >
> > Oh! Glad you spotted that.
>
> I took a look and the basic idea sounds good though I admit I did not check
> every detail. While looking into it I found some opportunities for improvements,
> plus I found what appear to be some longstanding bugs in the area, one of which
> causes a grep test failure on Solaris (and I suspect the bug is also on
> GNU/Linux but the grep tests don't catch it). I installed the attached patches
> into Gnulib, updated grep to point to the new Gnulib version, and added a note
> in grep's NEWS file about this.
>
> Patch 1 is what Norihiro Tanaka proposed in Bug#40634#32, except I edited the
> commit message. Patch 2 consists of minor cleanups and performance tweaks for
> Patch 1. (Patches 3 and 4 are omitted as they were installed by others into
> Gnulib at about the same time I was installing these.) Patch 5 fixes a
> dfa-heap-overrun failure on Solaris that appears to be a longstanding bug
> exposed by Patch 1 when running on Solaris. Patch 6 merely cleans up code near
> Patch 5. Patch 7 fixes the use of an uninitialized constraint, which I
> discovered while debugging Patch 5 under Valgrind; this also appears to be a
> longstandiung bug.
>
> Coming up with test cases for all these bugs would be pretty tricky, unfortunately.

Wow! Thank you!




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 21 Sep 2020 19:23:01 GMT) Full text and rfc822 format available.

Notification sent to fryasu <at> yahoo.co.jp:
bug acknowledged by developer. (Mon, 21 Sep 2020 19:23:01 GMT) Full text and rfc822 format available.

Message #58 received at 40634-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: fryasu <at> yahoo.co.jp, Gnulib bugs <bug-gnulib <at> gnu.org>,
 40634-done <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>,
 GNU grep developers <grep-devel <at> gnu.org>
Subject: Re: bug#40634: Massive pattern list handling with -E format seems
 very slow since 2.28.
Date: Mon, 21 Sep 2020 12:22:50 -0700
The dust seems to have settled on this, so I'm closing the grep bug report to 
tidy things up.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 20 Oct 2020 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 182 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.