GNU bug report logs -
#43862
[PATCH] grep: set RE_NO_SUB for calling regex only to check syntax
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Thu, 8 Oct 2020 09:41:01 UTC
Severity: normal
Tags: patch
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 43862 in the body.
You can then email your comments to 43862 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#43862
; Package
grep
.
(Thu, 08 Oct 2020 09:41:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Thu, 08 Oct 2020 09:41:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
We can set RE_NO_SUB for calling regex only to check syntax. It brings
performance gains in cases to have a lot of enormous epsilon nodes.
$ printf '(%020000d)\n' | sed 's/0/|/g' >pat
(before)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 6.15
user 4.62
sys 1.52
(after)
$ time -p env LC_ALL=C src/grep -Ef pat /dev/null
real 0.66
user 0.19
sys 0.46
[0001-grep-set-RE_NO_SUB-for-calling-regex-only-to-check-s.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#43862
; Package
grep
.
(Mon, 12 Oct 2020 23:09:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 43862 <at> debbugs.gnu.org (full text, mbox):
On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> We can set RE_NO_SUB for calling regex only to check syntax. It brings
> performance gains in cases to have a lot of enormous epsilon nodes.
>
>
> $ printf '(%020000d)\n' | sed 's/0/|/g' >pat
>
> (before)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 6.15
> user 4.62
> sys 1.52
>
> (after)
> $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> real 0.66
> user 0.19
> sys 0.46
Thank you.
FYI, when running similar commands with and without your patch (with
an eye to adding a test), I ran this one (with your patch). It shows
that using 80,000 terms caused grep to consume 32GB of memory before
being OOM-killed:
$ printf '(%080000d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
Command terminated by signal 9
6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
32024460maxresident)k
6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
[Exit 137 (KILL)]
I will come back to this later this week.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Sun, 01 Nov 2020 19:41:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Sun, 01 Nov 2020 19:41:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 43862-done <at> debbugs.gnu.org (full text, mbox):
On Mon, Oct 12, 2020 at 4:08 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Thu, Oct 8, 2020 at 2:41 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> >
> > We can set RE_NO_SUB for calling regex only to check syntax. It brings
> > performance gains in cases to have a lot of enormous epsilon nodes.
> >
> >
> > $ printf '(%020000d)\n' | sed 's/0/|/g' >pat
> >
> > (before)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 6.15
> > user 4.62
> > sys 1.52
> >
> > (after)
> > $ time -p env LC_ALL=C src/grep -Ef pat /dev/null
> > real 0.66
> > user 0.19
> > sys 0.46
>
> Thank you.
>
> FYI, when running similar commands with and without your patch (with
> an eye to adding a test), I ran this one (with your patch). It shows
> that using 80,000 terms caused grep to consume 32GB of memory before
> being OOM-killed:
>
> $ printf '(%080000d)\n' | sed 's/0/|/g' | env time src/grep -Ef- /dev/null
> Command terminated by signal 9
> 6.42user 19.98system 0:57.91elapsed 45%CPU (0avgtext+0avgdata
> 32024460maxresident)k
> 6504inputs+0outputs (92major+12003644minor)pagefaults 0swaps
> [Exit 137 (KILL)]
>
> I will come back to this later this week.
We must accept the fact that extreme regular expressions will cause
resource exhaustion like that when processed by classical regex_*
functions. This is yet another good reason to prefer PCRE and to use
grep's -P option. In that case, it fails like this:
$ printf '(%080000d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
grep: regular expression is too large
I have just pushed your patch, but without adding a test.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#43862
; Package
grep
.
(Mon, 02 Nov 2020 04:22:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 43862-done <at> debbugs.gnu.org (full text, mbox):
On Sun, 1 Nov 2020 11:39:55 -0800
Jim Meyering <jim <at> meyering.net> wrote:
> We must accept the fact that extreme regular expressions will cause
> resource exhaustion like that when processed by classical regex_*
> functions. This is yet another good reason to prefer PCRE and to use
> grep's -P option. In that case, it fails like this:
>
> $ printf '(%080000d)\n' | sed 's/0/|/g' |grep -Pf- /dev/null
> grep: regular expression is too large
>
> I have just pushed your patch, but without adding a test.
I also investigated the slowdown, and I reached the same view as you.
The regex consumes a lot of memory for patterns with normous epsilon
closures.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 30 Nov 2020 12:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 147 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.