GNU bug report logs - #43862
[PATCH] grep: set RE_NO_SUB for calling regex only to check syntax

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Thu, 8 Oct 2020 09:41:01 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 43862 in the body.
You can then email your comments to 43862 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#43862; Package grep. (Thu, 08 Oct 2020 09:41:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 08 Oct 2020 09:41:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH] grep: set RE_NO_SUB for calling regex only to check syntax
Date: Thu, 08 Oct 2020 18:40:36 +0900
[Message part 1 (text/plain, inline)]
We can set RE_NO_SUB for calling regex only to check syntax.  It brings
performance gains in cases to have a lot of enormous epsilon nodes.


$ printf '(%020000d)\n' | sed 's/0/|/g' >pat

(before)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 6.15
user 4.62
sys 1.52

(after)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 0.66
user 0.19
sys 0.46
[0001-grep-set-RE_NO_SUB-for-calling-regex-only-to-check-s.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#43862; Package grep. (Mon, 12 Oct 2020 23:09:02 GMT) Full text and rfc822 format available.

Message #8 received at 43862 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43862 <at> debbugs.gnu.org
Subject: Re: bug#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to
 check syntax
Date: Mon, 12 Oct 2020 16:08:28 -0700
On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> We can set RE_NO_SUB for calling regex only to check syntax.  It brings
> performance gains in cases to have a lot of enormous epsilon nodes.
>
>
> $ printf '(%020000d)\n' | sed 's/0/|/g' >pat
>
> (before)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 6.15
> user 4.62
> sys 1.52
>
> (after)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 0.66
> user 0.19
> sys 0.46

Thank you.

FYI, when running similar commands with and without your patch (with
an eye to adding a test), I ran this one (with your patch). It shows
that using 80,000 terms caused grep to consume 32GB of memory before
being OOM-killed:

$ printf '(%080000d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
Command terminated by signal 9
6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
32024460maxresident)k
6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
[Exit 137 (KILL)]

I will come back to this later this week.




Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 01 Nov 2020 19:41:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sun, 01 Nov 2020 19:41:02 GMT) Full text and rfc822 format available.

Message #13 received at 43862-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43862-done <at> debbugs.gnu.org
Subject: Re: bug#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to
 check syntax
Date: Sun, 1 Nov 2020 11:39:55 -0800
On Mon, Oct 12, 2020 at 4:08 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> >
> > We can set RE_NO_SUB for calling regex only to check syntax.  It brings
> > performance gains in cases to have a lot of enormous epsilon nodes.
> >
> >
> > $ printf '(%020000d)\n' | sed 's/0/|/g' >pat
> >
> > (before)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 6.15
> > user 4.62
> > sys 1.52
> >
> > (after)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 0.66
> > user 0.19
> > sys 0.46
>
> Thank you.
>
> FYI, when running similar commands with and without your patch (with
> an eye to adding a test), I ran this one (with your patch). It shows
> that using 80,000 terms caused grep to consume 32GB of memory before
> being OOM-killed:
>
> $ printf '(%080000d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
> Command terminated by signal 9
> 6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
> 32024460maxresident)k
> 6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
> [Exit 137 (KILL)]
>
> I will come back to this later this week.

We must accept the fact that extreme regular expressions will cause
resource exhaustion like that when processed by classical regex_*
functions. This is yet another good reason to prefer PCRE and to use
grep's -P option. In that case, it fails like this:

$ printf '(%080000d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
grep: regular expression is too large

I have just pushed your patch, but without adding a test.




Information forwarded to bug-grep <at> gnu.org:
bug#43862; Package grep. (Mon, 02 Nov 2020 04:22:01 GMT) Full text and rfc822 format available.

Message #16 received at 43862-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 43862-done <at> debbugs.gnu.org
Subject: Re: bug#43862: [PATCH] grep: set RE_NO_SUB for calling regex only to
 check syntax
Date: Mon, 02 Nov 2020 13:21:42 +0900
On Sun, 1 Nov 2020 11:39:55 -0800
Jim Meyering <jim <at> meyering.net> wrote:

> We must accept the fact that extreme regular expressions will cause
> resource exhaustion like that when processed by classical regex_*
> functions. This is yet another good reason to prefer PCRE and to use
> grep's -P option. In that case, it fails like this:
> 
> $ printf '(%080000d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
> grep: regular expression is too large
> 
> I have just pushed your patch, but without adding a test.

I also investigated the slowdown, and I reached the same view as you.

The regex consumes a lot of memory for patterns with normous epsilon
closures.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 30 Nov 2020 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 147 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.