GNU bug report logs -
#16865
grep -wP and backreferences
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16865 in the body.
You can then email your comments to 16865 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Mon, 24 Feb 2014 16:31:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 24 Feb 2014 16:31:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
Backreferences don't work with -w or -x in combination with -P:
$ echo aa | grep -Pw '(.)\1'
$
Or they work in an unexpected way:
$ echo aa | grep -Pw '(.)\2'
aa
The fix is simple:
--- src/pcresearch.c~ 2014-02-24 09:59:56.864374362 +0000
+++ src/pcresearch.c 2014-02-24 07:33:04.666398105 +0000
@@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si
*n = '\0';
if (match_lines)
- strcpy (n, "^(");
+ strcpy (n, "^(?:");
if (match_words)
- strcpy (n, "\\b(");
+ strcpy (n, "\\b(?:");
n += strlen (n);
/* The PCRE interface doesn't allow NUL bytes in the pattern, so
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Mon, 24 Feb 2014 20:01:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 16865 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Feb 24, 2014 at 2:01 AM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> Hello,
>
> Backreferences don't work with -w or -x in combination with -P:
>
> $ echo aa | grep -Pw '(.)\1'
> $
>
> Or they work in an unexpected way:
>
> $ echo aa | grep -Pw '(.)\2'
> aa
>
> The fix is simple:
>
>
> --- src/pcresearch.c~ 2014-02-24 09:59:56.864374362 +0000
> +++ src/pcresearch.c 2014-02-24 07:33:04.666398105 +0000
> @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si
Thanks a lot for the patch.
I've converted it to a proper commit with NEWS and a test case.
Please ack the attached if it's all ok with you (you're still the "Author:"):
[k.txt (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Mon, 24 Feb 2014 21:21:03 GMT)
Full text and
rfc822 format available.
Message #11 received at 16865 <at> debbugs.gnu.org (full text, mbox):
Fine by me, thanks.
BTW, as discussed in another bug, the -w/-x invalidate the
(*UCP) and other PCRE special sequences. Chances are we can't
easily do much about it, but it may still be worth documenting.
Like, one should use
grep -P '(*UCP)\bword\b'
as
grep -wP '(*UCP)word'
won't work (pcregrep has the same problem).
In another bug, I've seen someone commenting that
grep -wP 'a)(b'
doesn't give the error message that one would expect (not that
I'd expect anyone would care).
A last note: with -w, pcregrep wraps the regexp in \b...\b
instead of \b(?:...)\b, so it could be that those brackets are
not necessary in the first place.
Sorry I lied, it was not the last note ;-). Note the difference:
$ echo a@@b | grep -w @@
$ echo a@@b | grep -Pw @@
a@@b
Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
$ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
$ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
%aa%
Full text of original email included for reference:
2014-02-24 12:00:08 -0800, Jim Meyering:
> On Mon, Feb 24, 2014 at 2:01 AM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
> > Hello,
> >
> > Backreferences don't work with -w or -x in combination with -P:
> >
> > $ echo aa | grep -Pw '(.)\1'
> > $
> >
> > Or they work in an unexpected way:
> >
> > $ echo aa | grep -Pw '(.)\2'
> > aa
> >
> > The fix is simple:
> >
> >
> > --- src/pcresearch.c~ 2014-02-24 09:59:56.864374362 +0000
> > +++ src/pcresearch.c 2014-02-24 07:33:04.666398105 +0000
> > @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t si
>
> Thanks a lot for the patch.
> I've converted it to a proper commit with NEWS and a test case.
> Please ack the attached if it's all ok with you (you're still the "Author:"):
> From bfd21931b3cd088d642a190e9f030214df04045d Mon Sep 17 00:00:00 2001
> From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
> Date: Mon, 24 Feb 2014 11:54:09 -0800
> Subject: [PATCH] grep -P: fix it so backreferences now work with -w and -x
>
> To implement -w and -x, we bracket the search term with parentheses.
> However, that set of parentheses had the default semantics of
> "capturing", i.e., creating a backreferenceable matched quantity.
> Instead, use (?:...), to create a non-capturing group.
> * src/pcresearch.c (Pcompile): Use (?:...) rather than (...).
> * NEWS (Bug fixes): Mention it.
> * tests/pcre-wx-backref: New file.
> * tests/Makefile.am (TESTS): Add it.
> ---
> NEWS | 6 ++++++
> src/pcresearch.c | 4 ++--
> tests/Makefile.am | 1 +
> tests/pcre-wx-backref | 28 ++++++++++++++++++++++++++++
> 4 files changed, 37 insertions(+), 2 deletions(-)
> create mode 100755 tests/pcre-wx-backref
>
> diff --git a/NEWS b/NEWS
> index 771fd80..49fe984 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -2,6 +2,12 @@ GNU grep NEWS -*- outline -*-
>
> * Noteworthy changes in release ?.? (????-??-??) [?]
>
> +** Bug fixes
> +
> + grep -P now works with -w and -x and backreferences. Before,
> + echo aa|grep -Pw '(.)\1' would fail to match, yet
> + echo aa|grep -Pw '(.)\2' would match.
> +
>
> * Noteworthy changes in release 2.18 (2014-02-20) [stable]
>
> diff --git a/src/pcresearch.c b/src/pcresearch.c
> index 5b5ba3e..d4a20ff 100644
> --- a/src/pcresearch.c
> +++ b/src/pcresearch.c
> @@ -75,9 +75,9 @@ Pcompile (char const *pattern, size_t size)
>
> *n = '\0';
> if (match_lines)
> - strcpy (n, "^(");
> + strcpy (n, "^(?:");
> if (match_words)
> - strcpy (n, "\\b(");
> + strcpy (n, "\\b(?:");
> n += strlen (n);
>
> /* The PCRE interface doesn't allow NUL bytes in the pattern, so
> diff --git a/tests/Makefile.am b/tests/Makefile.am
> index 4ffea85..ecbe0e6 100644
> --- a/tests/Makefile.am
> +++ b/tests/Makefile.am
> @@ -83,6 +83,7 @@ TESTS = \
> pcre-abort \
> pcre-invalid-utf8-input \
> pcre-utf8 \
> + pcre-wx-backref \
> pcre-z \
> prefix-of-multibyte \
> r-dot \
> diff --git a/tests/pcre-wx-backref b/tests/pcre-wx-backref
> new file mode 100755
> index 0000000..643aa9b
> --- /dev/null
> +++ b/tests/pcre-wx-backref
> @@ -0,0 +1,28 @@
> +#! /bin/sh
> +# Before grep-2.19, grep -P and -w/-x would not with a backreference.
> +#
> +# Copyright (C) 2014 Free Software Foundation, Inc.
> +#
> +# Copying and distribution of this file, with or without modification,
> +# are permitted in any medium without royalty provided the copyright
> +# notice and this notice are preserved.
> +
> +. "${srcdir=.}/init.sh"; path_prepend_ ../src
> +require_pcre_
> +
> +echo aa > in || framework_failure_
> +echo 'grep: reference to non-existent subpattern' > exp-err \
> + || framework_failure_
> +
> +fail=0
> +
> +for xw in x w; do
> + grep -P$xw '(.)\1' in > out 2>&1 || fail=1
> + compare out in || fail=1
> +
> + grep -P$xw '(.)\2' in > out 2> err && fail=1
> + compare /dev/null out || fail=1
> + compare exp-err err || fail=1
> +done
> +
> +Exit $fail
> --
> 1.9.0
>
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Tue, 25 Feb 2014 04:57:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 16865 <at> debbugs.gnu.org (full text, mbox):
On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> A last note: with -w, pcregrep wraps the regexp in \b...\b
> instead of \b(?:...)\b, so it could be that those brackets are
> not necessary in the first place.
>
> Sorry I lied, it was not the last note ;-). Note the difference:
>
> $ echo a@@b | grep -w @@
> $ echo a@@b | grep -Pw @@
> a@@b
>
>
> Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
>
> $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
> $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
> %aa%
I like both suggestions. Making -wP work like grep's -w makes perfect sense.
Care to prepare a patch to make it do that, with a separate test case?
"git format-patch ..." output preferred, if you're game.
I pushed the above patch, but would welcome another one.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Tue, 25 Feb 2014 16:09:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 16865 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
2014-02-24 20:55:42 -0800, Jim Meyering:
> On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
> > A last note: with -w, pcregrep wraps the regexp in \b...\b
> > instead of \b(?:...)\b, so it could be that those brackets are
> > not necessary in the first place.
The brackets are actually needed in cases like:
grep -Pw 'foo|bar'
(pcregrep has a bug there).
> > Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
> >
> > $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
> > $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
> > %aa%
>
> I like both suggestions. Making -wP work like grep's -w makes perfect sense.
> Care to prepare a patch to make it do that, with a separate test case?
> "git format-patch ..." output preferred, if you're game.
>
> I pushed the above patch, but would welcome another one.
Please find the patch attached.
(note that tests/word-delim-multibyte fails for me, but it's not
my doing, it was failing before).
--
Stephane
[0001-Align-grep-Pw-with-grep-w.patch (text/x-diff, attachment)]
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Tue, 25 Feb 2014 18:04:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
bug acknowledged by developer.
(Tue, 25 Feb 2014 18:04:04 GMT)
Full text and
rfc822 format available.
Message #22 received at 16865-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Tue, Feb 25, 2014 at 8:08 AM, Stephane Chazelas
<stephane.chazelas <at> gmail.com> wrote:
> 2014-02-24 20:55:42 -0800, Jim Meyering:
>> On Mon, Feb 24, 2014 at 1:20 PM, Stephane Chazelas
>> <stephane.chazelas <at> gmail.com> wrote:
>> > A last note: with -w, pcregrep wraps the regexp in \b...\b
>> > instead of \b(?:...)\b, so it could be that those brackets are
>> > not necessary in the first place.
>
> The brackets are actually needed in cases like:
>
> grep -Pw 'foo|bar'
>
> (pcregrep has a bug there).
>
>
>> > Maybe instead of \b(?:...)\b, we could use (?<!\w)...(?!\w)
>> >
>> > $ echo a%%b | grep -P '(?<!\w)%%(?!\w)'
>> > $ echo %aa% | grep -P '(?<!\w)aa(?!\w)'
>> > %aa%
>>
>> I like both suggestions. Making -wP work like grep's -w makes perfect sense.
>> Care to prepare a patch to make it do that, with a separate test case?
>> "git format-patch ..." output preferred, if you're game.
>>
>> I pushed the above patch, but would welcome another one.
>
> Please find the patch attached.
Thank you very much. Nearly perfect.
I've uncapitalized the 1-line summary, changed a That to This
in the log, and added examples to NEWS, and added an empty
line to restore the 2-empty-line section delimiter.
> (note that tests/word-delim-multibyte fails for me, but it's not
> my doing, it was failing before).
That's an XFAIL test (as noted in tests/Makefile.am), hence, expected
to fail, and as long as it fails as expected, "make check" can still succeed.
I've closed this ticket, and will push once you ack these changes.
[0001-align-grep-Pw-with-grep-w.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16865
; Package
grep
.
(Tue, 25 Feb 2014 19:14:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 16865-done <at> debbugs.gnu.org (full text, mbox):
2014-02-25 10:03:28 -0800, Jim Meyering:
[...]
> I've uncapitalized the 1-line summary, changed a That to This
> in the log, and added examples to NEWS, and added an empty
> line to restore the 2-empty-line section delimiter.
[...]
> I've closed this ticket, and will push once you ack these changes.
[...]
Thanks. Changes fine by me.
--
Stephane
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 26 Mar 2014 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 30 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.