GNU bug report logs - #22655
grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine)

Package: grep;

Reported by: Sergei Trofimovich <slyfox <at> gentoo.org>

Date: Sat, 13 Feb 2016 23:24:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22655 in the body.
You can then email your comments to 22655 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 13 Feb 2016 23:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sergei Trofimovich <slyfox <at> gentoo.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 13 Feb 2016 23:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Sergei Trofimovich <slyfox <at> gentoo.org>
To: bug-grep <at> gnu.org
Cc: Jim Meyering <meyering <at> fb.com>, skvadrik <at> gmail.com,
 Paul Eggert <eggert <at> cs.ucla.edu>, ulm <at> gentoo.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: grep-2.21 (and git master): --null-data and ranges work in an odd
 way (-P works fine)
Date: Sat, 13 Feb 2016 23:20:52 +0000

[Message part 1 (text/plain, inline)]

The issue is found by Ulrich Mueller:

It seems DFA engine does not understand --null-data:

~/dev/git/grep $ cat a-test.sh 
#!/bin/bash

printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z    '^[1234yz]*$'  | wc -c
printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234yz]*$'  | wc -c
printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z    '^[1234y-z]*$' | wc -c
printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234y-z]*$' | wc -c

~/dev/git/grep $ ./a-test.sh 
0
6
6
6

All 4 should return 6 but first is not correct.
It seems that 'y-z' range disables dfa.c code and works fine.

-- 

  Sergei

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 14 Feb 2016 00:10:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: Sergei Trofimovich <slyfox <at> gentoo.org>
Cc: Jim Meyering <meyering <at> fb.com>, skvadrik <at> gmail.com, bug-grep <at> gnu.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd
 way (-P works fine)
Date: Sun, 14 Feb 2016 01:08:43 +0100

>>>>> On Sat, 13 Feb 2016, Sergei Trofimovich wrote:

> The issue is found by Ulrich Mueller:
> It seems DFA engine does not understand --null-data:

> ~/dev/git/grep $ cat a-test.sh 
> #!/bin/bash

> printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z    '^[1234yz]*$'  | wc -c
> printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234yz]*$'  | wc -c
> printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z    '^[1234y-z]*$' | wc -c
> printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234y-z]*$' | wc -c

> ~/dev/git/grep $ ./a-test.sh 
> 0
> 6
> 6
> 6

> All 4 should return 6 but first is not correct.
> It seems that 'y-z' range disables dfa.c code and works fine.

Hm, I think it is the other way around. \n is a normal char, so the
regex shouldn't match (and all four should return 0). In any case,
behaviour should be the same for all of them.

Downstream bug: https://bugs.gentoo.org/show_bug.cgi?id=574662

Ulrich

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 14 Feb 2016 21:59:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ulya Fokanova <skvadrik <at> gmail.com>
To: Ulrich Mueller <ulm <at> gentoo.org>, Sergei Trofimovich <slyfox <at> gentoo.org>
Cc: Jim Meyering <meyering <at> fb.com>, bug-grep <at> gnu.org,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd
 way (-P works fine)
Date: Sun, 14 Feb 2016 20:02:13 +0000

[Message part 1 (text/plain, inline)]

I've explored the following case:

   $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
   6

It's a bug (there should be no match).

This is what grep does:

 * triesto build DFA (as indfa.c)
 * fails to expand character range [1-4] because of multibyte
   localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF
   that suppressesall dfa.c-related code), note the difference with
   [1234] casein whichthere's no need to expand multibyte range
 * falls back to Regex (gnulib extension of regex.h)
 * Regex doesn't support '-z'semantics(the closest configuration to
   '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP
   set), so '\n'is treated as newline and match erroneously succeeds

I think this should be worked around in grep: before calling 're_search' 
it should split the input string by 'eolbyte'.

The bug also present with PCRE engine:

   $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
   6
   $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
   6

Ulya

[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 04:20:02 GMT) Full text and rfc822 format available.

Message #14 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Ulya Fokanova <skvadrik <at> gmail.com>
Cc: Jim Meyering <meyering <at> fb.com>, Ulrich Mueller <ulm <at> gentoo.org>,
 22655 <at> debbugs.gnu.org, Sergei Trofimovich <slyfox <at> gentoo.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sat, 20 Feb 2016 20:19:20 -0800

[Message part 1 (text/plain, inline)]

On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova <skvadrik <at> gmail.com> wrote:
> I've explored the following case:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
>    6
>
> It's a bug (there should be no match).
>
> This is what grep does:
>
>  * triesto build DFA (as indfa.c)
>  * fails to expand character range [1-4] because of multibyte
>    localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF
>    that suppressesall dfa.c-related code), note the difference with
>    [1234] casein whichthere's no need to expand multibyte range
>  * falls back to Regex (gnulib extension of regex.h)
>  * Regex doesn't support '-z'semantics(the closest configuration to
>    '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP
>    set), so '\n'is treated as newline and match erroneously succeeds
>
> I think this should be worked around in grep: before calling 're_search' it
> should split the input string by 'eolbyte'.
>
> The bug also present with PCRE engine:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
>    6
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
>    6

Thank you for the analysis and the report.
I have fixed the regex-oriented problem with the attached
patch, but not yet the case using -P -z (PCRE + --null-data):

[0001-grep-z-avoid-erroneous-match-with-regexp-anchor-and-.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 04:24:01 GMT) Full text and rfc822 format available.

Message #17 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Ulya Fokanova <skvadrik <at> gmail.com>
Cc: Jim Meyering <meyering <at> fb.com>, Ulrich Mueller <ulm <at> gentoo.org>,
 22655 <at> debbugs.gnu.org, Sergei Trofimovich <slyfox <at> gentoo.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sat, 20 Feb 2016 20:23:05 -0800

[Message part 1 (text/plain, inline)]

On Sat, Feb 20, 2016 at 8:19 PM, Jim Meyering <jim <at> meyering.net> wrote:
...
>> The bug also present with PCRE engine:
>>
>>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
>>    6
>>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
>>    6
>
> Thank you for the analysis and the report.
> I have fixed the regex-oriented problem with the attached
> patch, but not yet the case using -P -z (PCRE + --null-data):

FTR, I've also pushed the attached test-improving patch:

[0001-tests-convert-cmd-fail-1-to-returns_-1-cmd-fail-1.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 10:47:02 GMT) Full text and rfc822 format available.

Message #20 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: Jim Meyering <jim <at> meyering.net>
Cc: Jim Meyering <meyering <at> fb.com>, Ulya Fokanova <skvadrik <at> gmail.com>,
 22655 <at> debbugs.gnu.org, Sergei Trofimovich <slyfox <at> gentoo.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sun, 21 Feb 2016 11:46:30 +0100

>>>>> On Sat, 20 Feb 2016, Jim Meyering wrote:

> I have fixed the regex-oriented problem with the attached patch [...]

I confirm that this patch (applied to sys-apps/grep-2.23 in Gentoo)
fixes the problem for me.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 15:37:02 GMT) Full text and rfc822 format available.

Message #23 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Ulrich Mueller <ulm <at> gentoo.org>
Cc: Jim Meyering <meyering <at> fb.com>, Ulya Fokanova <skvadrik <at> gmail.com>,
 22655 <at> debbugs.gnu.org, Sergei Trofimovich <slyfox <at> gentoo.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sun, 21 Feb 2016 07:36:27 -0800

[Message part 1 (text/plain, inline)]

On Sun, Feb 21, 2016 at 2:46 AM, Ulrich Mueller <ulm <at> gentoo.org> wrote:
>>>>>> On Sat, 20 Feb 2016, Jim Meyering wrote:
>
>> I have fixed the regex-oriented problem with the attached patch [...]
>
> I confirm that this patch (applied to sys-apps/grep-2.23 in Gentoo)
> fixes the problem for me.

Thanks for checking.
I reread the patch I pushed and was dismayed to see I'd left
in a debugging statement.  This additional patch removes
the offending line (harmless unless you have a writable
directory named /t):

[0001-tests-test-cleanup.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 16:36:02 GMT) Full text and rfc822 format available.

Message #26 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Ulya Fokanova <skvadrik <at> gmail.com>
Cc: Ulrich Mueller <ulm <at> gentoo.org>, 22655 <at> debbugs.gnu.org,
 Sergei Trofimovich <slyfox <at> gentoo.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sun, 21 Feb 2016 08:34:34 -0800

On Sat, Feb 20, 2016 at 8:19 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova <skvadrik <at> gmail.com> wrote:
>> I've explored the following case:
>>
>>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
>>    6
...
>> The bug also present with PCRE engine:
>>
>>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
>>    6
>>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
>>    6
>
> Thank you for the analysis and the report.
> I have fixed the regex-oriented problem with the attached
> patch, but not yet the case using -P -z (PCRE + --null-data):

The -Pz/PCRE problem is more fundamental, and strikes
even with LC_ALL=C. This shows that with -Pz, anchors
still wrongly match at newlines, rather than at \0 bytes:

  $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^a'
  [Exit 1]
  $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^b'
  (standard input)

Fixing this is on PCRE's maint/README wish list with this item:

. Line endings:
  * Option to use NUL as a line terminator in subject strings. This could now
    be done relatively easily since the extension to support LF, CR, and CRLF.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 21 Feb 2016 20:07:01 GMT) Full text and rfc822 format available.

Message #29 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Ulya Fokanova <skvadrik <at> gmail.com>
Cc: Ulrich Mueller <ulm <at> gentoo.org>, 22655 <at> debbugs.gnu.org,
 Sergei Trofimovich <slyfox <at> gentoo.org>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Sun, 21 Feb 2016 12:06:28 -0800

[Message part 1 (text/plain, inline)]

Jim Meyering wrote:
> The -Pz/PCRE problem is more fundamental, and strikes
> even with LC_ALL=C.

grep should report an error when this problem occurs, rather than silently give 
incorrect answers. I installed the attached patch in a bold attempt to implement 
this; please feel free to revert if you think it goes too far.

[0001-grep-Pz-is-incompatible-with-and.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Mon, 22 Feb 2016 15:02:02 GMT) Full text and rfc822 format available.

Message #32 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Ulya Fokanova <skvadrik <at> gmail.com>, 22655 <at> debbugs.gnu.org,
 Ulrich Mueller <ulm <at> gentoo.org>, Sergei Trofimovich <slyfox <at> gentoo.org>
Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges
 work in an odd way (-P works fine)
Date: Mon, 22 Feb 2016 07:00:54 -0800

On Sun, Feb 21, 2016 at 12:06 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Jim Meyering wrote:
>>
>> The -Pz/PCRE problem is more fundamental, and strikes
>> even with LC_ALL=C.
>
> grep should report an error when this problem occurs, rather than silently
> give incorrect answers. I installed the attached patch in a bold attempt to
> implement this; please feel free to revert if you think it goes too far.

Thank you, Paul.
That suits me, and I found no fault with the patch.

These two bugs are a big enough deal that I now plan to
make another bug-fix-only release this week.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Wed, 23 Mar 2016 16:04:01 GMT) Full text and rfc822 format available.

Message #35 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22655 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Subject: [PATCH] grep: -Pz no longer misdiagnoses [^a]
Date: Wed, 23 Mar 2016 09:03:01 -0700

Problem reported by Michael Jess.
* NEWS: Document this.
* src/pcresearch.c (Pcompile): Do not diagnose [^ when [ is unescaped.
* tests/pcre: Test for the bug.
---
 NEWS             | 3 +++
 src/pcresearch.c | 8 +++++---
 tests/pcre       | 1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index 144f668..69e4a23 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,9 @@ GNU grep NEWS                                    -*- outline -*-
 
 ** Bug fixes
 
+  grep -Pz no longer mistakenly diagnoses patterns like [^a] that use
+  negated character classes. [bug introduced in grep-2.24]
+
   grep -oz now uses null bytes, not newlines, to terminate output lines.
   [bug introduced in grep-2.5]
 
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 3b8e795..f6e72b0 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -127,15 +127,17 @@ Pcompile (char const *pattern, size_t size)
   if (! eolbyte)
     {
       bool escaped = false;
+      bool after_unescaped_left_bracket = false;
       for (p = pattern; *p; p++)
         if (escaped)
-          escaped = false;
+          escaped = after_unescaped_left_bracket = false;
         else
           {
-            escaped = *p == '\\';
-            if (*p == '^' || *p == '$')
+            if (*p == '$' || (*p == '^' && !after_unescaped_left_bracket))
               error (EXIT_TROUBLE, 0,
                      _("unescaped ^ or $ not supported with -Pz"));
+            escaped = *p == '\\';
+            after_unescaped_left_bracket = *p == '[';
           }
     }
 
diff --git a/tests/pcre b/tests/pcre
index b8b4662..8f3d9a4 100755
--- a/tests/pcre
+++ b/tests/pcre
@@ -15,5 +15,6 @@ fail=0
 echo | grep -P '\s*$' || fail=1
 echo | returns_ 2 grep -zP '\s$' || fail=1
 echo '.ab' | returns_ 1 grep -Pwx ab || fail=1
+echo x | grep -Pz '[^a]' || fail=1
 
 Exit $fail
-- 
2.5.5

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 11 Apr 2016 04:50:02 GMT) Full text and rfc822 format available.

Notification sent to Sergei Trofimovich <slyfox <at> gentoo.org>:
bug acknowledged by developer. (Mon, 11 Apr 2016 04:50:02 GMT) Full text and rfc822 format available.

Message #40 received at 22655-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22655-done <at> debbugs.gnu.org
Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd
 way (-P works fine)
Date: Sun, 10 Apr 2016 21:49:23 -0700

As this bug now appears to be fixed in grep master (or at least, as fixed as 
well as libpcre will allow), I'm closing it.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 09 May 2016 11:24:05 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Stephane Chazelas <stephane.chazelas <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 18 Nov 2016 12:52:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 12:53:02 GMT) Full text and rfc822 format available.

Message #47 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: 22655 <at> debbugs.gnu.org
Subject: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 12:52:29 +0000

Hello,

I'm just finding out that ^ and $ no longer work with grep -Pz:
https://unix.stackexchange.com/questions/324263/grep-command-doesnt-support-and-anchors-when-its-with-pz

$ grep -Pz '^'
grep: unescaped ^ or $ not supported with -Pz

Which points to this bug.

Note that, it's not that pcre doesn't support NULL-delimited
records, it's that grep calls pcre with the wrong flag
(PCRE_MULTILINE) which is like the m flag in /.../m perl RE
operator which is explicitely to tell ^ to match at the
beginning of the subject *but also after every newline* (same
for $).

As already noted at
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=16871#8

printf 'a\nb\0' | grep -Pz '^b'

did match which was a bug indeed, but only because of that
PCRE_MULTILINE flag. If you turned off that flag:

printf 'a\nb\0' | grep -Pz '(?-m)^b'

Then it wouldn't match.

With grep 2.10:

$ printf 'a\nb\0c\0' | grep -Poz '^.'
a
b    # BUG
c
$ printf 'a\nb\0c\0' | grep -Poz '(?-m)^.'
a
c

Or use \A and \z in place of ^ and $ that match at the beginning
of the subject regardless of the state of the "m" flag:

$ printf 'a\nb\0c\0' | grep -Poz '\A.'
a
c

Now with the new version, we need to use those \A, \z. Or if we
want to match at the beginning of any of the lines in a NUL
delimited record, we need ugly things like:

grep -Pz '(?:\A|(?<=\n))'

instead of

grep -Pz '(?m)^'


Can that bug please be reopened so it can be addressed
differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)?

-- 
Stephane

Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 18 Nov 2016 12:54:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 15:55:02 GMT) Full text and rfc822 format available.

Message #52 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>, 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 07:54:29 -0800

Stephane Chazelas wrote:
> Can that bug please be reopened so it can be addressed
> differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)?

Removing PCRE_MULTILINE will make 'grep' waaaay slower. Can you think of a way 
of fixing the bug that doesn't involve removing PCRE_MULTILINE?

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 16:25:01 GMT) Full text and rfc822 format available.

Message #55 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 16:24:02 +0000

2016-11-18 07:54:29 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> >Can that bug please be reopened so it can be addressed
> >differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)?
> 
> Removing PCRE_MULTILINE will make 'grep' waaaay slower. Can you
> think of a way of fixing the bug that doesn't involve removing
> PCRE_MULTILINE?

Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
some overhead. It is really meant to be only about changing the
behaviour of ^ and $.

Not-PCRE_MULTILINE is the default about everywhere else
(including less, pcregrep, ssed -R).

Do you have any particular test-case in mind?

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 16:39:02 GMT) Full text and rfc822 format available.

Message #58 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 16:37:54 +0000

2016-11-18 16:24:02 +0000, Stephane Chazelas:
[...]
> Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
> some overhead. It is really meant to be only about changing the
> behaviour of ^ and $.
[...]

Unsurprisingly,

where ~/a contains the output of find / -print0 (which is
typically what grep -z is used on) 304MiB, 3696401 records,
none of which contain a newline character in this instance.
with grep 2.10 (Ubuntu 12.04 amd64) in a UTF-8 locale:

$ time grep -Pz '(?-m)^/' ~/a > /dev/null
grep -Pz '(?-m)^/' ~/a > /dev/null  0.84s user 0.15s system 99% cpu 0.990 total
$ time grep -Pz '^/' ~/a > /dev/null
grep -Pz '^/' ~/a > /dev/null  0.87s user 0.12s system 99% cpu 0.989 total

Not much difference as "/" is found at the beginning of each
record.

$ time grep -Pz '(?-m)^x' ~/a > /dev/null
grep -Pz '(?-m)^x' ~/a > /dev/null  0.41s user 0.06s system 99% cpu 0.473 total
$ time grep -Pz '^x' ~/a > /dev/null
grep -Pz '^x' ~/a > /dev/null  0.81s user 0.04s system 99% cpu 0.854 total

PCRE_MULTILINE significantly slows things down as even though
"x" is not found at the beginning of the subject, grep still
needs to look for extra newline characters in the record.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 16:49:02 GMT) Full text and rfc822 format available.

Message #61 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 08:48:04 -0800

Stephane Chazelas wrote:
> Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
> some overhead.

As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an entire 
buffer that contains many lines, and this lets PCRE efficiently find the first 
match in the whole buffer. If grep doesn't use PCRE_MULTILINE, grep would have 
to apply the pattern to each line separately, which could be significantly slower.

That being said, PCRE matching is pretty slow already, so perhaps we shouldn't 
worry about efficiency here.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 17:07:01 GMT) Full text and rfc822 format available.

Message #64 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 17:06:36 +0000

2016-11-18 08:48:04 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> >Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
> >some overhead.
> 
> As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an
> entire buffer that contains many lines, and this lets PCRE
> efficiently find the first match in the whole buffer. If grep
> doesn't use PCRE_MULTILINE, grep would have to apply the pattern to
> each line separately, which could be significantly slower.
[...]

That might have been the case a long time ago, as I remember
some discussion about it as it explained some wrong information
in the documentation, but as far as I and gdb can tell, grep
2.26 at least call pcre_exec for every line of the input with
grep -P.

If it didn't

echo test | grep -P '\n$'

would match.

I'll try and dig up the old discussions.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 17:34:02 GMT) Full text and rfc822 format available.

Message #67 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 17:33:00 +0000

2016-11-18 17:06:36 +0000, Stephane Chazelas:
[...]
> That might have been the case a long time ago, as I remember
> some discussion about it as it explained some wrong information
> in the documentation, but as far as I and gdb can tell, grep
> 2.26 at least call pcre_exec for every line of the input with
> grep -P.
> 
> If it didn't
> 
> echo test | grep -P '\n$'
> 
> would match.
> 
> I'll try and dig up the old discussions.
[...]

I can't find the discussions, and they should have been a
follow-up on https://debbugs.gnu.org/cgi/bugreport.cgi?bug=16871
but looking at the code, that changed in commit
a14685c2833f7c28a427fecfaf146e0a861d94ba in 2010 for
https://savannah.gnu.org/bugs/?27460

And shortly after, -Pz was supported (as that was made
straightforward to do  after the change I guess).

Note that we should also be able to support grep -Pe pat1 -e
pat2 now as well.

And PCRE_MULTILINE now can only improve performance.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 17:47:01 GMT) Full text and rfc822 format available.

Message #70 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 17:46:46 +0000

2016-11-18 17:32:59 +0000, Stephane Chazelas:
[...]
> And PCRE_MULTILINE now can only improve performance.
[...]

Sorry, I meant "and *removing* PCRE_MULTILINE now can only
improve performance". (and we need to add PCRE_DOLLAR_ENDONLY or
otherwise printf 'a\n\0' | grep -Pz 'a$' would still match).

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 17:48:02 GMT) Full text and rfc822 format available.

Message #73 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 09:47:50 -0800

Stephane Chazelas wrote:
> $ time grep -Pz '(?-m)^/' ~/a > /dev/null

It looks like you want "^" to stand for a newline character, not the start of a 
line. That is not how grep -z works. -z causes the null byte to be the line 
delimiter, and "^" should stand for a position immediately after a null byte (or 
at start of file).

It might be nice to have a syntax for matching a newline byte with -z (or a null 
byte without -z, for that matter). But that would be a new feature.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 18:08:01 GMT) Full text and rfc822 format available.

Message #76 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 18:07:51 +0000

2016-11-18 09:47:50 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> >$ time grep -Pz '(?-m)^/' ~/a > /dev/null
> 
> It looks like you want "^" to stand for a newline character, not the
> start of a line. That is not how grep -z works. -z causes the null
> byte to be the line delimiter, and "^" should stand for a position
> immediately after a null byte (or at start of file).
[...]

No, sorry if I wasn't very clear, that's the other way round and
it's the whole point of this discussion.

grep had a bug in that it was calling pcre_exec on the content
of each null delimited record with a regex compiled with
PCRE_MULTILINE

That caused

printf 'a\nb\0' | grep -zP '^b'

to match even though the record doesn't start with a "b".

To work around it, you have to disable the PCRE_MULTILINE flag
in the regexp syntax with the (?-m) PCRE operator, or use \A
instead of ^.

The problem was /fixed/ (and I'm arguing here it's the wrong fix),
by disallowing ^ with -Pz while the obvious fix is to remove
that PCRE_MULTILINE flag.

As it turns out PCRE_MULTILINE is there because in the old days,
before grep -Pz was supported, with grep -P (without -z), grep
would pass more than one line to pcre_exec. If you look at the
grep bug history, 90% of the grep pcre related bugs were caused
by that.

It was fixed/changed in
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=a14685c2833f7c28a427fecfaf146e0a861d94ba
but Paolo forgot to remove the PCRE_MULTILINE flag when the code
was changed to pass one line at a time to pcre_exec and
PCRE_MULTILINE was no longer needed anymore (and later called
problem when grep -Pz was supported).

> It might be nice to have a syntax for matching a newline byte with
> -z (or a null byte without -z, for that matter). But that would be a
> new feature.

That feature is already there. That's the (?m) PCRE operator.

That's the whole point. That m flag (PCRE_MULTILINE) is on by
default in GNU grep, and that's what it's causing all the
problems.

Once you turn it off *by default*, that makes ^ match the
beginning of the NUL-delimited record as it should and one can
use (?m) if he wants ^ to match the beginning of each line in
the NUL-delimited record instead of just the beginning of the
record.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Fri, 18 Nov 2016 23:38:02 GMT) Full text and rfc822 format available.

Message #79 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Fri, 18 Nov 2016 15:37:16 -0800

Stephane Chazelas wrote:
> 2016-11-18 08:48:04 -0800, Paul Eggert:
>> Stephane Chazelas wrote:
>>> Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
>>> some overhead.
>>
>> As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an
>> entire buffer that contains many lines, and this lets PCRE
>> efficiently find the first match in the whole buffer. If grep
>> doesn't use PCRE_MULTILINE, grep would have to apply the pattern to
>> each line separately, which could be significantly slower.
> [...]
>
> That might have been the case a long time ago, as I remember
> some discussion about it as it explained some wrong information
> in the documentation, but as far as I and gdb can tell, grep
> 2.26 at least call pcre_exec for every line of the input with
> grep -P.
>

Although that was true starting with commit 
a14685c2833f7c28a427fecfaf146e0a861d94ba (2010-03-04), it became false starting 
with commit 9fa500407137f49f6edc3c6b4ee6c7096f0190c5 (2014-09-16).

> If it didn't
>
> echo test | grep -P '\n$'
>
> would match.

No, because grep omits the trailing newline in that particular input. And for 
this example:

printf 'test\n\n' | grep -p '\n$'

grep passes "test\n" to jit_exec, determines that jit_exec returns a match that 
crosses a line boundary, and rejects the match.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 08:10:02 GMT) Full text and rfc822 format available.

Message #82 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 00:09:11 -0800

Stephane Chazelas wrote:
> Why would it make it slower. AFAICT, PCRE_MULTILINE *adds*
> some overhead.

After looking into it I now remember why PCRE_MULTILINE speeds things up. See:

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef

where using PCRE_MULTILINE sped up 'grep -P "z.*a"' by a factor of 220.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 08:37:02 GMT) Full text and rfc822 format available.

Message #85 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 00:36:12 -0800

[Message part 1 (text/plain, inline)]

Stephane Chazelas wrote:
> one can
> use (?m) if he wants ^ to match the beginning of each line in
> the NUL-delimited record instead of just the beginning of the
> record.

I think the intent is that ^ and $ should match only the line-terminator 
specified by -z (or by -z's absence). So the sort of usage you describe is 
unspecified and not supported. That being said, it does make sense to match 
tricky regular expressions like that line by line, even if this hurts 
performance. Otherwise, I suspect there are even trickier regular expressions 
that could reject a buffer full of lines even though it contains matching lines. 
When in doubt we should avoid optimization so I installed the attached patch 
into the master branch. Please give it a try.

[0001-grep-Pz-no-longer-rejects.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 09:23:01 GMT) Full text and rfc822 format available.

Message #88 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org, Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 03:22:07 -0600

On Sat, Nov 19, 2016 at 12:36:12AM -0800, Paul Eggert wrote:
>Stephane Chazelas wrote:
>>one can
>>use (?m) if he wants ^ to match the beginning of each line in
>>the NUL-delimited record instead of just the beginning of the
>>record.
>
>I think the intent is that ^ and $ should match only the 
>line-terminator specified by -z (or by -z's absence). So the sort of 
>usage you describe is unspecified and not supported. That being said, 
>it does make sense to match tricky regular expressions like that line 
>by line, even if this hurts performance. Otherwise, I suspect there 
>are even trickier regular expressions that could reject a buffer full 
>of lines even though it contains matching lines. When in doubt we 
>should avoid optimization so I installed the attached patch into the 
>master branch. Please give it a try.

>From 0e00fe0fc34184b1cdcea92a671eb9ffebb4899b Mon Sep 17 00:00:00 2001
>From: Paul Eggert <eggert <at> cs.ucla.edu>
>Date: Sat, 19 Nov 2016 00:25:46 -0800
>Subject: [PATCH] grep: -Pz no longer rejects ^, $
>
>Problem reported by Stephane Chazelas (Bug#22655).
>* NEWS: Document this.
>* doc/grep.texi (grep Programs): Warn about -Pz.
>* src/pcresearch.c (reflags): New static var.
>(multibyte_locale): Remove static var; now local to Pcompile.
>(Pcompile): Check for (? and (* too.  Set reflags instead of
>dying when problematic operators are found.
>(Pexecute): Use reflags to decide whether searches should
>be multiline.
>* tests/pcre: Test new behavior.

I'm a bit confused by this patch -- I see 'reflags' being tested in 
Pexecute(), but I don't see it getting set anywhere, just Pcompile()'s 
local 'flags'...I'm guessing 'flags' was supposed to be replaced by 
'reflags'?  (Not entirely certain though.)


Zev

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 09:38:02 GMT) Full text and rfc822 format available.

Message #91 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 22655 <at> debbugs.gnu.org, Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 01:37:18 -0800

[Message part 1 (text/plain, inline)]

Zev Weiss wrote:
> I see 'reflags' being tested in Pexecute(), but I don't see it getting set anywhere

Oops, that somehow got lost during merging. Thanks for catching that. Omitting 
the initialization caused hurt performance due to a failure to use 
PCRE_MULTILINE but did not cause a correctness bug, so the tests didn't catch 
it. Fixed with the attached patch.

[0001-grep-fix-performance-typo-with-P.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 10:42:01 GMT) Full text and rfc822 format available.

Message #94 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 10:41:27 +0000

2016-11-18 15:37:16 -0800, Paul Eggert:
[...]
> >That might have been the case a long time ago, as I remember
> >some discussion about it as it explained some wrong information
> >in the documentation, but as far as I and gdb can tell, grep
> >2.26 at least call pcre_exec for every line of the input with
> >grep -P.
> >
> 
> Although that was true starting with commit
> a14685c2833f7c28a427fecfaf146e0a861d94ba (2010-03-04), it became
> false starting with commit 9fa500407137f49f6edc3c6b4ee6c7096f0190c5
> (2014-09-16).
[...]

OK, it looks like I don't have the full story, and my multiple
calls to pcre_exec() seems to point to something else:

$ seq 10 | ltrace  -e '*pcre*' ./src/grep -P .
grep->pcre_maketables(0x221e2f0, 0x221e240, 1, 2)                                                                     = 0x221e310
grep->pcre_compile(0x221e2f0, 2050, 0x7ffe943ec6f8, 0x7ffe943ec6f4)                                                   = 0x221e760
grep->pcre_study(0x221e760, 1, 0x7ffe943ec6f8, 0x7ffe943eb490)                                                        = 0x221e7b0
grep->pcre_fullinfo(0x221e760, 0x221e7b0, 16, 0x7ffe943ec6f4)                                                         = 0
grep->pcre_exec(0x221e760, 0x221e7b0, "", 0, 0, 128, 0x7ffe943ec700, 300)                                             = -1
grep->pcre_exec(0x221e760, 0x221e7b0, "", 0, 0, 0, 0x7ffe943ec700, 300)                                               = -1
grep->pcre_exec(0x221e760, 0x221e7b0, "1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n", 20, 0, 8192, 0x7ffe943ec4e0, 300)            = 1
1
grep->pcre_exec(0x221e760, 0x221e7b0, "2\n3\n4\n5\n6\n7\n8\n9\n10\n", 18, 0, 8192, 0x7ffe943ec4e0, 300)               = 1
2
grep->pcre_exec(0x221e760, 0x221e7b0, "3\n4\n5\n6\n7\n8\n9\n10\n", 16, 0, 8192, 0x7ffe943ec4e0, 300)                  = 1
3
grep->pcre_exec(0x221e760, 0x221e7b0, "4\n5\n6\n7\n8\n9\n10\n", 14, 0, 8192, 0x7ffe943ec4e0, 300)                     = 1
4
grep->pcre_exec(0x221e760, 0x221e7b0, "5\n6\n7\n8\n9\n10\n", 12, 0, 8192, 0x7ffe943ec4e0, 300)                        = 1
5
grep->pcre_exec(0x221e760, 0x221e7b0, "6\n7\n8\n9\n10\n", 10, 0, 8192, 0x7ffe943ec4e0, 300)                           = 1
6
grep->pcre_exec(0x221e760, 0x221e7b0, "7\n8\n9\n10\n", 8, 0, 8192, 0x7ffe943ec4e0, 300)                               = 1
7
grep->pcre_exec(0x221e760, 0x221e7b0, "8\n9\n10\n", 6, 0, 8192, 0x7ffe943ec4e0, 300)                                  = 1
8
grep->pcre_exec(0x221e760, 0x221e7b0, "9\n10\n", 4, 0, 8192, 0x7ffe943ec4e0, 300)                                     = 1
9
grep->pcre_exec(0x221e760, 0x221e7b0, "10\n", 2, 0, 8192, 0x7ffe943ec4e0, 300)                                        = 1
10
+++ exited (status 0) +++

I don't know the details of why it's done that way, but I'm not
sure I can see how calling pcre_exec that way can be quicker
than calling it on each individual line/record.


Note that this is still wrong:

$ printf 'a\nb\0' | ./src/grep -zxP a
a
b

Removing PCRE_MULTILINE (and get back to calling pcre_exec on
every record separately) would help except in the cases where the
user does:

grep -xzP '(?m)a'

You'd want to change:

  static char const xprefix[] = "^(?:";
  static char const xsuffix[] = ")$";

To:

  static char const xprefix[] = "\A(?:";
  static char const xsuffix[] = ")\z";


-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 11:23:02 GMT) Full text and rfc822 format available.

Message #97 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 03:22:23 -0800

[Message part 1 (text/plain, inline)]

Stephane Chazelas wrote:

> I don't know the details of why it's done that way, but I'm not
> sure I can see how calling pcre_exec that way can be quicker
> than calling it on each individual line/record.

It can be hundreds of times faster in common cases. See:

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef

> Note that this is still wrong:
>
> $ printf 'a\nb\0' | ./src/grep -zxP a
> a
> b

Thanks, fixed by installing the attached.

> Removing PCRE_MULTILINE (and get back to calling pcre_exec on
> every record separately) would help except in the cases where the
> user does:
>
> grep -xzP '(?m)a'

I don't think grep can address this problem, as in general that would require 
interpreting the PCRE pattern at run-time and grep should not be delving into 
PCRE internals. Uses of (?m) lead to unspecified behavior in grep, and 
applications should not rely on any particular behavior in this area. This is 
firmly in the Perl tradition, as the Perl documentation for this part of the 
regular expression syntax says "The stability of these extensions varies widely. 
Some ... are experimental and may change without warning or be completely 
removed." Also, the grep manual says that -P "is highly experimental". User 
beware, that's all.

[0001-grep-fix-zxP-bug.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 14:39:01 GMT) Full text and rfc822 format available.

Message #100 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Aaron Crane <grep <at> aaroncrane.co.uk>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org, Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 14:37:59 +0000

Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Stephane Chazelas wrote:
>> Removing PCRE_MULTILINE (and get back to calling pcre_exec on
>> every record separately) would help except in the cases where the
>> user does:
>>
>> grep -xzP '(?m)a'
>
> I don't think grep can address this problem, as in general that would
> require interpreting the PCRE pattern at run-time and grep should not be
> delving into PCRE internals. Uses of (?m) lead to unspecified behavior in
> grep, and applications should not rely on any particular behavior in this
> area. This is firmly in the Perl tradition, as the Perl documentation for
> this part of the regular expression syntax says "The stability of these
> extensions varies widely. Some ... are experimental and may change without
> warning or be completely removed." Also, the grep manual says that -P "is
> highly experimental". User beware, that's all.

I believe the sense of "stability" in that line of the Perl
documentation is "expected continued availability of that feature in
future versions of Perl". But most of the constructs covered by that
caveat have been unchanged (except for bug fixes) for well over a
decade; and some of them, including (?m) and friends, date to Perl
5.000, released in 1994. I can therefore say with a high degree of
confidence that they aren't going to be removed or changed in future
versions of Perl 5, and I've just removed that outdated notice from
the Perl documentation:

https://perl5.git.perl.org/perl.git/commitdiff/ff8bb4687895e07f822f5227d573c967aa0a4524

(That change should be part of the next stable release of Perl, namely
5.26, expected in May 2017.)

Beyond that, I'm not sure it's ideal to use the Perl documentation as
a comprehensive guide to PCRE: the two implementations are
independent, with slightly differing sets of features, and with known
differences in behaviour on edge cases even for features they share.
So even if Perl did consider constructs like (?m) to be "unstable" or
"unreliable", PCRE's authors and maintainers might take a different
view.

That said, I do accept that there are limits to what grep can
reasonably do here — I certainly wouldn't advocate that it try to
delve into PCRE internals! — and in particular, I recognise that -P is
documented as experimental.

-- 
Aaron Crane ** http://aaroncrane.co.uk/

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 16:15:01 GMT) Full text and rfc822 format available.

Message #103 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 16:14:28 +0000

2016-11-19 03:22:23 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> 
> >I don't know the details of why it's done that way, but I'm not
> >sure I can see how calling pcre_exec that way can be quicker
> >than calling it on each individual line/record.
> 
> It can be hundreds of times faster in common cases. See:
> 
> http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef

On that same sample, when comparing with pcregrep, an
implementation which does the right thing IMO and is what grep
is documented to do, that is match each line against the regexp
passed as argument and without PCRE_MULTILINE

I don't find a x220 factor, more like a x2.5 factor:

$ time pcregrep  "z.*a" k
pcregrep "z.*a" k  1.00s user 0.05s system 99% cpu 1.047 total
$ time ./grep -P  "z.*a" k
./grep -P "z.*a" k  0.41s user 0.05s system 99% cpu 0.457 total

On the other hand if you change the pattern to "z[^+]*a",
pcregrep still takes about one second, but GNU grep a lot longer
(I gave up after a few minutes).

With a simpler example:

$ time seq 10000 | ./grep -P '[^a]*1000[^a]*2000'
seq 10000  0.00s user 0.00s system 0% cpu 0.002 total
./grep -P '[^a]*1000[^a]*2000'  1.99s user 0.00s system 99% cpu 1.991 total
$ time seq 10000 | pcregrep '[^a]*1000[^a]*2000'
seq 10000  0.00s user 0.00s system 0% cpu 0.001 total
pcregrep '[^a]*1000[^a]*2000'  0.00s user 0.00s system 0% cpu 0.002 total

That's at least a x1000 factor.

If I understand correctly, you're calling pcre_exec on several
lines and with PCRE_MULTILINE as an optimisation in the cases
where the RE is unlikely to match, so one call to pcre_exec can
go through many lines at a time, and if it doesn't match, that's
as many lines you can safely skip.

The example above is an one that shows it's an invalid
optimisation. seq 10000 obviously has no line that matches that
pattern, but still because GNU grep tries to match it on a
string that contains multiple lines, it will match in several
places, and you need to call pcre_exec again to double-check
each of those false positives which is counter-productive.

AFAICT, that optimisation only brings a little optimisation in
some cases (and reduces performance in others) but also reduces
functionality and breaks users' expectations. IMO, it's not
worth it.

[...]
> >grep -xzP '(?m)a'
> 
> I don't think grep can address this problem, as in general that
> would require interpreting the PCRE pattern at run-time and grep
> should not be delving into PCRE internals. Uses of (?m) lead to
> unspecified behavior in grep, and applications should not rely on
> any particular behavior in this area. 

I agree, grep should not try and outsmart libpcre, it should
just call it on every line (or NUL delimited record with -z), so
that users can expect the regexp it passes to grep to be matched
on those lines/records.

> This is firmly in the Perl
> tradition, as the Perl documentation for this part of the regular
> expression syntax says "The stability of these extensions varies
> widely. Some ... are experimental and may change without warning or
> be completely removed."

No, those s, m, i... flags are at the core of perl regexp
matching, they are certainly not experimental.

Most of the ones that are "experimental" in perl are not
available in PCRE.

> Also, the grep manual says that -P "is
> highly experimental". User beware, that's all.

It's highly experimental because grep is not using PCRE the
straightforward way. The simple "grep -P re" could be
implemented in a few lines of code (the equivalent of a
fgets+pcre_exec loop).

PCRE is the modern de-facto regular expression standard. Most
modern languages have regexps that are more or less compatible
with those (when they don't link to libpcre). If you look at the
trends on stackoverflow.com or unix.stackexchange.com, you'll
notice that nowadays, GNU grep is more often called with -P than
not.

I don't think you can just dismiss it as "not GNU grep's core
functionality".

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 16:46:01 GMT) Full text and rfc822 format available.

Message #106 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 16:45:29 +0000

2016-11-19 16:14:28 +0000, Stephane Chazelas:
[...]
> AFAICT, that optimisation only brings a little optimisation in
> some cases (and reduces performance in others) but also reduces
> functionality and breaks users' expectations. IMO, it's not
> worth it.
[...]

To be fair, that last part about users' expectations is highly
reduced with your recent changes. Thanks for that and for
re-enabling grep -Pz ^/$

I just think that if grep always did the simplest thing of
calling pcre_exec on every record in all the cases,
that would make for simpler, more easily maintainable and less
"experimental" code, and probably wouldn't affect performance so
badly on average.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 21:53:02 GMT) Full text and rfc822 format available.

Message #109 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 21:52:35 +0000

2016-11-19 16:45:29 +0000, Stephane Chazelas:
> 2016-11-19 16:14:28 +0000, Stephane Chazelas:
> [...]
> > AFAICT, that optimisation only brings a little optimisation in
> > some cases (and reduces performance in others) but also reduces
> > functionality and breaks users' expectations. IMO, it's not
> > worth it.
> [...]
> 
> To be fair, that last part about users' expectations is highly
> reduced with your recent changes. Thanks for that and for
> re-enabling grep -Pz ^/$
[...]

A few cases left:

$ printf 'a\nb\n' | ./grep -P 'a\z'
$ printf 'a\nb\n' | pcregrep 'a\z'
a

Note that it's https://savannah.gnu.org/bugs/?27460 already
mentioned whose fix at the time was to stop passing several
lines to pcre_exec(). (A regression test could have helped spot
this one.)

$ printf 'a\nb\n' | ./grep -P 'a[^b]*+$'
$ printf 'a\nb\n' | pcregrep 'a[^b]*+$'
a

*+ is the non-backtracking * (same for other repeating
operators). The example is a bit contrieved, but that hopefully
illustrates that any extended PCRE operator, current or future
could fool that GNU grep optimisation.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sat, 19 Nov 2016 23:13:01 GMT) Full text and rfc822 format available.

Message #112 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aaron Crane <grep <at> aaroncrane.co.uk>
Cc: 22655 <at> debbugs.gnu.org, Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 15:12:41 -0800

Aaron Crane wrote:
> I'm not sure it's ideal to use the Perl documentation as
> a comprehensive guide to PCRE:

Unfortunately libpcre does not document the regular expression syntax it 
supports. Neither does the grep manual, for 'grep -P'. And as you've mentioned, 
the Perl manual isn't a reliable source for libpcre either. So users of 'grep 
-P' cannot rely on any documentation for behavior; it's an unwritten tradition 
instead.

Long ago when a friend told me "When you're telling someone else where you plan 
to throw the Frisbee, don't say anything more specific than 'Watch this!'" 
That's 'grep -P' in a nutshell.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 20 Nov 2016 07:15:01 GMT) Full text and rfc822 format available.

Message #115 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655 <at> debbugs.gnu.org, Aaron Crane <grep <at> aaroncrane.co.uk>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sun, 20 Nov 2016 07:14:28 +0000

2016-11-19 15:12:41 -0800, Paul Eggert:
> Aaron Crane wrote:
> >I'm not sure it's ideal to use the Perl documentation as
> >a comprehensive guide to PCRE:
> 
> Unfortunately libpcre does not document the regular expression
> syntax it supports. Neither does the grep manual, for 'grep -P'. And
> as you've mentioned, the Perl manual isn't a reliable source for
> libpcre either. So users of 'grep -P' cannot rely on any
> documentation for behavior; it's an unwritten tradition instead.
[...]

You've missed "man pcrepattern". See also "man pcresyntax" for
the cheat sheet and "man pcre" for the index of all the PCRE man
pages.

If that's a consolation, I also had been using PCRE years before
I discovered that man page (which I find of very high quality
btw) few years ago.

-- 
Stephane

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 20 Nov 2016 07:17:01 GMT) Full text and rfc822 format available.

Message #118 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org, Aaron Crane <grep <at> aaroncrane.co.uk>
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 23:16:17 -0800

Stephane Chazelas wrote:
> You've missed "man pcrepattern".

Yes I did. Thanks. Google was not my friend there.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 20 Nov 2016 07:58:01 GMT) Full text and rfc822 format available.

Message #121 received at 22655 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655 <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sat, 19 Nov 2016 23:57:22 -0800

[Message part 1 (text/plain, inline)]

Stephane Chazelas wrote:
> I don't find a x220 factor, more like a x2.5 factor:

I think I found the factor-of-hundreds slowdown, and fixed it in the 2nd 
attached patch.

When I tried your benchmark with pcregrep (pcre 8.39, configured with 
--enable-unicode-properties), and with ./grep0 (which has the PCRE_MULTILINE 
implementation, i.e., commit da94c91a81fc63275371d0580d8688b6abd85346), and with 
./grep (which is grep after the attached patches are installed), I got timings 
like the following:

    user  sys
    1.972 0.072 LC_ALL=en_US.utf8 pcregrep -u "z.*a" k
    0.234 0.076 LC_ALL=en_US.utf8 ./grep0 -P "z.*a" k
    1.280 0.064 LC_ALL=en_US.utf8 ./grep -P "z.*a" k
    1.487 0.077 LC_ALL=C pcregrep "z.*a" k
    0.193 0.067 LC_ALL=C ./grep0 -P "z.*a" k
    0.825 0.096 LC_ALL=C ./grep -P "z.*a" k

All times are CPU seconds. This is Fedora 24 x86-64, AMD Phenom II X4 910e. As 
before, k was created by the shell command: yes 'abcdefg hijklmn opqrstu vwxyz' 
| head -n 10000000 >k

So, on this benchmark using PCRE_MULTILINE gave a speedup of a factor of ~4.3 in 
a multibyte locale, and a speedup of ~3.5 in a unibyte locale.

> On the other hand if you change the pattern to "z[^+]*a",
> pcregrep still takes about one second, but GNU grep a lot longer

Yes, that example makes GNU grep -P look really bad. So installed the 1st 
attached patch, which mostly just reverts the January multiline patch, i.e., it 
goes back to the slower "./grep -P" lines measured above.

[0001-grep-P-no-longer-uses-PCRE_MULTILINE.patch (text/x-diff, attachment)]

[0002-grep-further-P-performance-fix.patch (text/x-diff, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 20 Nov 2016 08:04:02 GMT) Full text and rfc822 format available.

Notification sent to Sergei Trofimovich <slyfox <at> gentoo.org>:
bug acknowledged by developer. (Sun, 20 Nov 2016 08:04:02 GMT) Full text and rfc822 format available.

Message #126 received at 22655-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 22655-done <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sun, 20 Nov 2016 00:03:26 -0800

As all the bugs in this bug report appear to be fixed, I'm closing it now.

Information forwarded to bug-grep <at> gnu.org:
bug#22655; Package grep. (Sun, 20 Nov 2016 09:17:02 GMT) Full text and rfc822 format available.

Message #129 received at 22655-done <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 22655-done <at> debbugs.gnu.org
Subject: Re: bug#22655: grep -Pz '^' now fails!
Date: Sun, 20 Nov 2016 09:16:09 +0000

2016-11-20 00:03:26 -0800, Paul Eggert:
> As all the bugs in this bug report appear to be fixed, I'm closing it now.

Thanks.

BTW, I was wrong about

  static char const xprefix[] = "^(?:";
  static char const xsuffix[] = ")$";

(as opposed to \A, \z) causing a problem with grep -Pxz '(?m)...'

as the pattern becomes ^(?:(?m)...)$, so the m flag is only
applied within the (?:...). I can see pcre_compile() supports a
PCRE_ANCHORED flag, but it looks like it only makes ^ implicit
(not $).

-- 
Stephane

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 18 Dec 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 206 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #22655 grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine)

GNU bug report logs - #22655
grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine)