GNU bug report logs - #16865
grep -wP and backreferences

Previous Next

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Mon, 24 Feb 2014 16:31:02 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16865 in the body.
You can then email your comments to 16865 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Mon, 24 Feb 2014 16:31:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 24 Feb 2014 16:31:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: grep -wP and backreferences
Date: Mon, 24 Feb 2014 10:01:54 +0000
Hello,

Backreferences don't work with -w or -x in combination with -P:

$ echo aa | grep -Pw '(.)\1'
$

Or they work in an unexpected way:

$ echo aa | grep -Pw '(.)\2'
aa

The fix is simple:


--- src/pcresearch.c~	2014-02-24 09:59:56.864374362 +0000
+++ src/pcresearch.c	2014-02-24 07:33:04.666398105 +0000
@@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si
 
   *n = '\0';
   if (match_lines)
-    strcpy (n, "^(");
+    strcpy (n, "^(?:");
   if (match_words)
-    strcpy (n, "\\b(");
+    strcpy (n, "\\b(?:");
   n += strlen (n);
 
   /* The PCRE interface doesn't allow NUL bytes in the pattern, so




Information forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Mon, 24 Feb 2014 20:01:02 GMT) Full text and rfc822 format available.

Message #8 received at 16865 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 16865 <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Mon, 24 Feb 2014 12:00:08 -0800
[Message part 1 (text/plain, inline)]
On Mon, Feb 24, 2014 at 2:01 AM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> Hello,
>
> Backreferences don't work with -w or -x in combination with -P:
>
> $ echo aa | grep -Pw '(.)\1'
> $
>
> Or they work in an unexpected way:
>
> $ echo aa | grep -Pw '(.)\2'
> aa
>
> The fix is simple:
>
>
> --- src/pcresearch.c~   2014-02-24 09:59:56.864374362 +0000
> +++ src/pcresearch.c    2014-02-24 07:33:04.666398105 +0000
> @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si

Thanks a lot for the patch.
I've converted it to a proper commit with NEWS and a test case.
Please ack the attached if it's all ok with you (you're still the "Author:"):
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Mon, 24 Feb 2014 21:21:03 GMT) Full text and rfc822 format available.

Message #11 received at 16865 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16865 <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Mon, 24 Feb 2014 21:20:01 +0000
Fine by me, thanks.

BTW, as discussed in another bug, the -w/-x invalidate the
(*UCP) and other PCRE special sequences. Chances are we can't
easily do much about it, but it may still be worth documenting.

Like, one should use

grep -P '(*UCP)\bword\b'

as

grep -wP '(*UCP)word'

won't work (pcregrep has the same problem).

In another bug, I've seen someone commenting that

grep -wP 'a)(b'

doesn't give the error message that one would expect (not that
I'd expect anyone would care).

A last note: with -w, pcregrep wraps the regexp in \b...\b
instead of \b(?:...)\b, so it could be that those brackets are
not necessary in the first place.

Sorry I lied, it was not the last note ;-). Note the difference:

$ echo a@@b | grep -w @@
$ echo a@@b | grep -Pw @@
a@@b


Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)

$ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
$ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
%aa%



Full text of original email included for reference:

2014-02-24 12:00:08 -0800, Jim Meyering:
> On Mon, Feb 24, 2014 at 2:01 AM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
> > Hello,
> >
> > Backreferences don't work with -w or -x in combination with -P:
> >
> > $ echo aa | grep -Pw '(.)\1'
> > $
> >
> > Or they work in an unexpected way:
> >
> > $ echo aa | grep -Pw '(.)\2'
> > aa
> >
> > The fix is simple:
> >
> >
> > --- src/pcresearch.c~   2014-02-24 09:59:56.864374362 +0000
> > +++ src/pcresearch.c    2014-02-24 07:33:04.666398105 +0000
> > @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si
> 
> Thanks a lot for the patch.
> I've converted it to a proper commit with NEWS and a test case.
> Please ack the attached if it's all ok with you (you're still the "Author:"):

> From bfd21931b3cd088d642a190e9f030214df04045d Mon Sep 17 00:00:00 2001
> From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
> Date: Mon, 24 Feb 2014 11:54:09 -0800
> Subject: [PATCH] grep -P: fix it so backreferences now work with -w and -x
> 
> To implement -w and -x, we bracket the search term with parentheses.
> However, that set of parentheses had the default semantics of
> "capturing", i.e., creating a backreferenceable matched quantity.
> Instead, use (?:...), to create a non-capturing group.
> * src/pcresearch.c (Pcompile): Use (?:...) rather than (...).
> * NEWS (Bug fixes): Mention it.
> * tests/pcre-wx-backref: New file.
> * tests/Makefile.am (TESTS): Add it.
> ---
>  NEWS                  |  6 ++++++
>  src/pcresearch.c      |  4 ++--
>  tests/Makefile.am     |  1 +
>  tests/pcre-wx-backref | 28 ++++++++++++++++++++++++++++
>  4 files changed, 37 insertions(+), 2 deletions(-)
>  create mode 100755 tests/pcre-wx-backref
> 
> diff --git a/NEWS b/NEWS
> index 771fd80..49fe984 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -2,6 +2,12 @@ GNU grep NEWS                                    -*- outline -*-
> 
>  * Noteworthy changes in release ?.? (????-??-??) [?]
> 
> +** Bug fixes
> +
> +  grep -P now works with -w and -x and backreferences. Before,
> +  echo aa|grep -Pw '(.)\1' would fail to match, yet
> +  echo aa|grep -Pw '(.)\2' would match.
> +
> 
>  * Noteworthy changes in release 2.18 (2014-02-20) [stable]
> 
> diff --git a/src/pcresearch.c b/src/pcresearch.c
> index 5b5ba3e..d4a20ff 100644
> --- a/src/pcresearch.c
> +++ b/src/pcresearch.c
> @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t size)
> 
>    *n = '\0';
>    if (match_lines)
> -    strcpy (n, "^(");
> +    strcpy (n, "^(?:");
>    if (match_words)
> -    strcpy (n, "\\b(");
> +    strcpy (n, "\\b(?:");
>    n += strlen (n);
> 
>    /* The PCRE interface doesn't allow NUL bytes in the pattern, so
> diff --git a/tests/Makefile.am b/tests/Makefile.am
> index 4ffea85..ecbe0e6 100644
> --- a/tests/Makefile.am
> +++ b/tests/Makefile.am
> @@ -83,6 +83,7 @@ TESTS =						\
>    pcre-abort					\
>    pcre-invalid-utf8-input			\
>    pcre-utf8					\
> +  pcre-wx-backref				\
>    pcre-z					\
>    prefix-of-multibyte				\
>    r-dot						\
> diff --git a/tests/pcre-wx-backref b/tests/pcre-wx-backref
> new file mode 100755
> index 0000000..643aa9b
> --- /dev/null
> +++ b/tests/pcre-wx-backref
> @@ -0,0 +1,28 @@
> +#! /bin/sh
> +# Before grep-2.19, grep -P and -w/-x would not with a backreference.
> +#
> +# Copyright (C) 2014 Free Software Foundation, Inc.
> +#
> +# Copying and distribution of this file, with or without modification,
> +# are permitted in any medium without royalty provided the copyright
> +# notice and this notice are preserved.
> +
> +. "${srcdir=.}/init.sh"; path_prepend_ ../src
> +require_pcre_
> +
> +echo aa > in || framework_failure_
> +echo 'grep: reference to non-existent subpattern' > exp-err \
> +  || framework_failure_
> +
> +fail=0
> +
> +for xw in x w; do
> +  grep -P$xw '(.)\1' in > out 2>&1 || fail=1
> +  compare out in || fail=1
> +
> +  grep -P$xw '(.)\2' in > out 2> err && fail=1
> +  compare /dev/null out || fail=1
> +  compare exp-err err || fail=1
> +done
> +
> +Exit $fail
> -- 
> 1.9.0
> 





Information forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Tue, 25 Feb 2014 04:57:02 GMT) Full text and rfc822 format available.

Message #14 received at 16865 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 16865 <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Mon, 24 Feb 2014 20:55:42 -0800
On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> A last note: with -w, pcregrep wraps the regexp in \b...\b
> instead of \b(?:...)\b, so it could be that those brackets are
> not necessary in the first place.
>
> Sorry I lied, it was not the last note ;-). Note the difference:
>
> $ echo a@@b | grep -w @@
> $ echo a@@b | grep -Pw @@
> a@@b
>
>
> Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
>
> $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
> $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
> %aa%

I like both suggestions. Making -wP work like grep's -w makes perfect sense.
Care to prepare a patch to make it do that, with a separate test case?
"git format-patch ..." output preferred, if you're game.

I pushed the above patch, but would welcome another one.




Information forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Tue, 25 Feb 2014 16:09:01 GMT) Full text and rfc822 format available.

Message #17 received at 16865 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16865 <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Tue, 25 Feb 2014 16:08:22 +0000
[Message part 1 (text/plain, inline)]
2014-02-24 20:55:42 -0800, Jim Meyering:
> On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
> > A last note: with -w, pcregrep wraps the regexp in \b...\b
> > instead of \b(?:...)\b, so it could be that those brackets are
> > not necessary in the first place.

The brackets are actually needed in cases like:

grep -Pw 'foo|bar'

(pcregrep has a bug there).


> > Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
> >
> > $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
> > $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
> > %aa%
> 
> I like both suggestions. Making -wP work like grep's -w makes perfect sense.
> Care to prepare a patch to make it do that, with a separate test case?
> "git format-patch ..." output preferred, if you're game.
> 
> I pushed the above patch, but would welcome another one.

Please find the patch attached.

(note that tests/word-delim-multibyte fails for me, but it's not
my doing, it was failing before).

-- 
Stephane
[0001-Align-grep-Pw-with-grep-w.patch (text/x-diff, attachment)]

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Tue, 25 Feb 2014 18:04:03 GMT) Full text and rfc822 format available.

Notification sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
bug acknowledged by developer. (Tue, 25 Feb 2014 18:04:04 GMT) Full text and rfc822 format available.

Message #22 received at 16865-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: 16865-done <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Tue, 25 Feb 2014 10:03:28 -0800
[Message part 1 (text/plain, inline)]
On Tue, Feb 25, 2014 at 8:08 AM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> 2014-02-24 20:55:42 -0800, Jim Meyering:
>> On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
>> <stephane.chazelas <at> gmail.com> wrote:
>> > A last note: with -w, pcregrep wraps the regexp in \b...\b
>> > instead of \b(?:...)\b, so it could be that those brackets are
>> > not necessary in the first place.
>
> The brackets are actually needed in cases like:
>
> grep -Pw 'foo|bar'
>
> (pcregrep has a bug there).
>
>
>> > Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
>> >
>> > $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
>> > $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
>> > %aa%
>>
>> I like both suggestions. Making -wP work like grep's -w makes perfect sense.
>> Care to prepare a patch to make it do that, with a separate test case?
>> "git format-patch ..." output preferred, if you're game.
>>
>> I pushed the above patch, but would welcome another one.
>
> Please find the patch attached.

Thank you very much.  Nearly perfect.
I've uncapitalized the 1-line summary, changed a That to This
in the log, and added examples to NEWS, and added an empty
line to restore the 2-empty-line section delimiter.

> (note that tests/word-delim-multibyte fails for me, but it's not
> my doing, it was failing before).

That's an XFAIL test (as noted in tests/Makefile.am), hence, expected
to fail, and as long as it fails as expected, "make check" can still succeed.

I've closed this ticket, and will push once you ack these changes.
[0001-align-grep-Pw-with-grep-w.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16865; Package grep. (Tue, 25 Feb 2014 19:14:02 GMT) Full text and rfc822 format available.

Message #25 received at 16865-done <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16865-done <at> debbugs.gnu.org
Subject: Re: bug#16865: grep -wP and backreferences
Date: Tue, 25 Feb 2014 19:13:06 +0000
2014-02-25 10:03:28 -0800, Jim Meyering:
[...]
> I've uncapitalized the 1-line summary, changed a That to This
> in the log, and added examples to NEWS, and added an empty
> line to restore the 2-empty-line section delimiter.
[...]
> I've closed this ticket, and will push once you ack these changes.
[...]

Thanks. Changes fine by me.

-- 
Stephane




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 26 Mar 2014 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 5 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.