GNU bug report logs - #33249
[PATCH] grep: grouping of patterns including back reference

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 3 Nov 2018 11:43:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 33249 in the body.
You can then email your comments to 33249 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Sat, 03 Nov 2018 11:43:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 03 Nov 2018 11:43:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: [PATCH] grep: grouping of patterns including back reference
Date: Sat, 03 Nov 2018 20:41:46 +0900
[Message part 1 (text/plain, inline)]
Hi,

When grep uses regex, it splits a pattern with multiple lines by
newline character.  Compilation and executution run for each fragment. 
That causes slowdown.  By this change, each fragment is divided into
groups by whether the fragment includes back reference in a pattern or
not. a frgment which includes back reference constitutes group, and all
frgments which include back reference also constitute a group.

This change extremely speeds-up following case.

- before

    $ yes 00000000000000000000000000000000000000000x | head -10 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 1.94
    user 1.70
    sys 0.24

    $ yes 00000000000000000000000000000000000000000x | head -100 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 13.04
    user 12.78
    sys 0.25

- after

    $ yes 00000000000000000000000000000000000000000x | head -10 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 0.75
    user 0.56
    sys 0.17

    $ yes 00000000000000000000000000000000000000000x | head -100 >in
    $ time -p env LC_ALL=C src/grep -f pat in
    real 0.74
    user 0.57
    sys 0.16

Thanks,
Norihiro
[0001-grep-grouping-of-patterns-including-back-reference.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Sat, 03 Nov 2018 15:30:02 GMT) Full text and rfc822 format available.

Message #8 received at 33249 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 33249 <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sat, 3 Nov 2018 08:29:39 -0700
Norihiro Tanaka wrote:
> By this change, each fragment is divided into
> groups by whether the fragment includes back reference in a pattern or
> not. a frgment which includes back reference constitutes group, and all
> frgments which include back reference also constitute a group.

Surely this is not sufficient. An invocation of grep like this:

grep -E '(a
b)'

should be an error, but with the proposed patch won't it be equivalent to "grep 
-E '(a|b)'" since the pattern has no back-references?




Information forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Sat, 03 Nov 2018 23:22:01 GMT) Full text and rfc822 format available.

Message #11 received at 33249 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 33249 <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sun, 04 Nov 2018 08:21:18 +0900
On Sat, 3 Nov 2018 08:29:39 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > By this change, each fragment is divided into
> > groups by whether the fragment includes back reference in a pattern or
> > not. a frgment which includes back reference constitutes group, and all
> > frgments which include back reference also constitute a group.
> 
> Surely this is not sufficient. An invocation of grep like this:
> 
> grep -E '(a
> b)'
> 
> should be an error, but with the proposed patch won't it be equivalent to "grep -E '(a|b)'" since the pattern has no back-references?

Even the pattern has no back-references, compilation by regex run for
each line.  So Syntax errors will be detected as even your present.

  212:      if (!regex_compile (dc, p, len, dc->pcount, lineno, !backref))
  213:        compilation_failed = true;

  $ env LC_ALL=C src/grep -E '(a
  b)' ~/in
  src/grep: Unmatched ( or \(





Information forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Sun, 04 Nov 2018 04:03:01 GMT) Full text and rfc822 format available.

Message #14 received at 33249 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 33249 <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sat, 3 Nov 2018 21:02:19 -0700
Norihiro Tanaka wrote:
> Even the pattern has no back-references, compilation by regex run for
> each line.  So Syntax errors will be detected as even your present.

OK, but then I'm afraid I don't understand the motivation for the patch. Perhaps 
if you gave a self-contained example? The examples in the original bug report 
are incomplete because they're missing the pattern file; when I tried the first 
one I observed this:

$ yes 00000000000000000000000000000000000000000x | head -10 >in
$ time -p env LC_ALL=C grep -f pat in
grep: pat: No such file or directory
real 0.00
user 0.00
sys 0.00




Information forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Sun, 04 Nov 2018 04:27:01 GMT) Full text and rfc822 format available.

Message #17 received at 33249 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 33249 <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sun, 04 Nov 2018 13:25:55 +0900
On Sat, 3 Nov 2018 21:02:19 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Norihiro Tanaka wrote:
> > Even the pattern has no back-references, compilation by regex run for
> > each line.  So Syntax errors will be detected as even your present.
> 
> OK, but then I'm afraid I don't understand the motivation for the patch. Perhaps if you gave a self-contained example? The examples in the original bug report are incomplete because they're missing the pattern file; when I tried the first one I observed this:
> 
> $ yes 00000000000000000000000000000000000000000x | head -10 >in
> $ time -p env LC_ALL=C grep -f pat in
> grep: pat: No such file or directory
> real 0.00
> user 0.00
> sys 0.00

Sorry, It's missing.  The pattern for all cases is as following.

  $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat

Norihiro





Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 23 Dec 2019 00:58:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 23 Dec 2019 00:58:01 GMT) Full text and rfc822 format available.

Message #22 received at 33249-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 33249-done <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Sun, 22 Dec 2019 16:57:12 -0800
[Message part 1 (text/plain, inline)]
On 11/3/18 9:25 PM, Norihiro Tanaka wrote:

>   $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat

Thanks for the test case and sorry about the delay. And thanks for spotting the
speedup opportunity. I found a few problems with the proposed patch, though:

> +        if (keys[1] == '\\')
> +          keys++;

This mishandles the case where the input byte sequence contains '\', '\', '1'
where the first '\' is the last byte of a multibyte character. Such a byte
sequence can contain backreferences but the function will say it doesn't.

> +      if (backref && prev < p)
> +        {
> +          len = p - prev;
> +          buf = xrealloc (buf, (buflen + len) * sizeof *buf);
> +          memcpy (buf + buflen, p, len);
> +          buflen += len;
> +        }

This seems to have three problems. First, the memcpy copies from P, but it
should copy from PREV. Second, this code assigns to LEN, which breaks a later
use of LEN. Third, if there are many patterns with backreferences, repeated use
of realloc will have O(N**2) behavior.

> +      for (size_t i = 0; i < dc->pcount; i++)
> +        dc->patterns[i + 1] = dc->patterns[i];

This copies dc->patterns[0] to the later values in that array, when a memmove
was intended.

So, after installing the patch, I immediately installed the attached patch,
which should address the abovementioned issues.

Thanks again. You did the hard work - I merely proofread it.
[0001-grep-fix-some-bugs-in-pattern-grouping-speedup.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#33249; Package grep. (Mon, 23 Dec 2019 15:10:01 GMT) Full text and rfc822 format available.

Message #25 received at 33249-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 33249-done <at> debbugs.gnu.org
Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back
 reference
Date: Tue, 24 Dec 2019 00:09:03 +0900
On Sun, 22 Dec 2019 16:57:12 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 11/3/18 9:25 PM, Norihiro Tanaka wrote:
> 
> >   $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat
> 
> Thanks for the test case and sorry about the delay. And thanks for spotting the
> speedup opportunity. I found a few problems with the proposed patch, though:
> 
> > +        if (keys[1] == '\\')
> > +          keys++;
> 
> This mishandles the case where the input byte sequence contains '\', '\', '1'
> where the first '\' is the last byte of a multibyte character. Such a byte
> sequence can contain backreferences but the function will say it doesn't.
> 
> > +      if (backref && prev < p)
> > +        {
> > +          len = p - prev;
> > +          buf = xrealloc (buf, (buflen + len) * sizeof *buf);
> > +          memcpy (buf + buflen, p, len);
> > +          buflen += len;
> > +        }
> 
> This seems to have three problems. First, the memcpy copies from P, but it
> should copy from PREV. Second, this code assigns to LEN, which breaks a later
> use of LEN. Third, if there are many patterns with backreferences, repeated use
> of realloc will have O(N**2) behavior.
> 
> > +      for (size_t i = 0; i < dc->pcount; i++)
> > +        dc->patterns[i + 1] = dc->patterns[i];
> 
> This copies dc->patterns[0] to the later values in that array, when a memmove
> was intended.
> 
> So, after installing the patch, I immediately installed the attached patch,
> which should address the abovementioned issues.
> 
> Thanks again. You did the hard work - I merely proofread it.

Sorry and thanks for the review and fixing.  possible_backrefs_in_pattern
is greatful!  I can't find any solution better than it.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 21 Jan 2020 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 68 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.