GNU bug report logs - #68725
GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE capture groups and backreferences

Previous Next

Package: sed;

Reported by: Ed Morton <mortoneccc <at> comcast.net>

Date: Fri, 26 Jan 2024 04:17:01 UTC

Severity: normal

To reply to this bug, email your comments to 68725 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#68725; Package sed. (Fri, 26 Jan 2024 04:17:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ed Morton <mortoneccc <at> comcast.net>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Fri, 26 Jan 2024 04:17:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ed Morton <mortoneccc <at> comcast.net>
To: bug-sed <at> gnu.org, bug-grep <at> gnu.org
Subject: GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE
 capture groups and backreferences
Date: Thu, 25 Jan 2024 10:46:34 -0600
[Message part 1 (text/plain, inline)]
There are issues (mostly common but some not) using a regexp like this:

   |^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$|

with GNU grep and GNU sed, hence my contacting both mailing lists but 
apologies if that was the wrong starting point.

This started out as a question on StackOverflow, 
(https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446) 
but my "answer" and some comments from there copied below so you don't 
have to look anywhere else for a description of the issues.

Given this input file:

|a|
|ab|
|abba|
|abcdef|
|abcba|
|zufolo|
|||Removing the `$` from the end of the regexp (i.e. making it less 
restrictive) produces fewer matches, which is the opposite of what it 
should do: a) With the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba It's not just 
GNU grep that behaves strangely, GNU sed has the same behavior from the 
question when just matching with `sed -nE '/.../p' sample` as GNU `grep` 
does AND sed behaves differently if we're just doing a match vs if we're 
doing a match + replace. For example here's `sed` doing a 
match+replacement and behaving the same way as `grep` above: a) With the 
`$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's 
sed just doing a match and behaving differently from any of the above: 
a) With the `$` at the end of the regexp (note the extra `ab` in the 
output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab 
abba abcba zufolo b) Without the `$` at the end of the regexp (note the 
extra `ab` and `abcdef` in the output): $ sed -nE 
'/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba 
zufolo Also interestingly this: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <a> <abba> 
<abcba> <>zufolo the last line of which means the regexp is apparently 
matching the start of the line and ignoring the `$` end-of-string 
metachar present in the regexp! The odd behavior isn't just associated 
with using `-E`, though, if I remove `-E` and just use [POSIX compliant 
BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03) 
then: a) With the `$` at the end of the regexp: $ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' 
sample a abba abcba zufolo <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' 
sample a abba abcba zufolo b) Without the `$` at the end of the regexp: 
$ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' 
sample a abba abcba <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' 
sample a abba abcba and again just doing a match in sed below behaves 
differently from the sed match+replacements above: a) With the `$` at 
the end of the regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' 
sample a ab abba abcba zufolo b) Without the `$` at the end of the 
regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' 
sample a ab abba abcdef abcba zufolo The above shows that, given the 
same regexp, sed is apparently matching different strings depending on 
whether it's doing a substitution or not. These are the version I was 
using when testing above: $ grep --version | head -1 grep (GNU grep) 
3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed 
out that grep in git-=bash produces an error message and core dumps 
given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion 
"num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack 
Aborted (core dumped)|. Sorry, I can't copy the core off that machine 
for corporate reasons. Those git-bash tests were using |$ echo 
$BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0|
|Regards, Ed Morton |
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#68725; Package sed. (Tue, 06 Feb 2024 07:04:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Ed Morton <mortoneccc <at> comcast.net>
Cc: 68725 <at> debbugs.gnu.org, bug-grep <at> gnu.org
Subject: Re: bug#68725: GNU grep and sed behaving unexpectedly with multiple
 1-or-0 RE capture groups and backreferences
Date: Mon, 5 Feb 2024 23:02:42 -0800
On Fri, Jan 26, 2024 at 6:51 AM Ed Morton <mortoneccc <at> comcast.net> wrote:
>
> There are issues (mostly common but some not) using a regexp like this:
>
>     |^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$|
>
> with GNU grep and GNU sed, hence my contacting both mailing lists but
> apologies if that was the wrong starting point.
>
> This started out as a question on StackOverflow,
> (https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446)
> but my "answer" and some comments from there copied below so you don't
> have to look anywhere else for a description of the issues.
>
> Given this input file:
>
> |a|
> |ab|
> |abba|
> |abcdef|
> |abcba|
> |zufolo|
> |||Removing the `$` from the end of the regexp (i.e. making it less
> restrictive) produces fewer matches, which is the opposite of what it
> should do: a) With the `$` at the end of the regexp: $ grep -E
> '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b)
> Without the `$` at the end of the regexp: $ grep -E
> '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba

Thanks for reporting that. This is as far as I've gotten for now, but
this sure looks like a bug:

  $ echo zufolo | grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$'
  zufolo

Obviously, that string should not match.

Note that it works properly with the -P option in place of that -E.

> It's not just
> GNU grep that behaves strangely, GNU sed has the same behavior from the
> question when just matching with `sed -nE '/.../p' sample` as GNU `grep`
> does AND sed behaves differently if we're just doing a match vs if we're
> doing a match + replace. For example here's `sed` doing a
> match+replacement and behaving the same way as `grep` above: a) With the
> `$` at the end of the regexp: $ sed -nE
> 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b)
> Without the `$` at the end of the regexp: $ sed -nE
> 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's
> sed just doing a match and behaving differently from any of the above:
> a) With the `$` at the end of the regexp (note the extra `ab` in the
> output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab
> abba abcba zufolo b) Without the `$` at the end of the regexp (note the
> extra `ab` and `abcdef` in the output): $ sed -nE
> '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba
> zufolo Also interestingly this: $ sed -nE
> 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <a> <abba>
> <abcba> <>zufolo the last line of which means the regexp is apparently
> matching the start of the line and ignoring the `$` end-of-string
> metachar present in the regexp! The odd behavior isn't just associated
> with using `-E`, though, if I remove `-E` and just use [POSIX compliant
> BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03)
> then: a) With the `$` at the end of the regexp: $ grep
> '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$'
> sample a abba abcba zufolo <p> $ sed -n
> 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p'
> sample a abba abcba zufolo b) Without the `$` at the end of the regexp:
> $ grep
> '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1'
> sample a abba abcba <p> $ sed -n
> 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p'
> sample a abba abcba and again just doing a match in sed below behaves
> differently from the sed match+replacements above: a) With the `$` at
> the end of the regexp: $ sed -n
> '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p'
> sample a ab abba abcba zufolo b) Without the `$` at the end of the
> regexp: $ sed -n
> '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p'
> sample a ab abba abcdef abcba zufolo The above shows that, given the
> same regexp, sed is apparently matching different strings depending on
> whether it's doing a substitution or not. These are the version I was
> using when testing above: $ grep --version | head -1 grep (GNU grep)
> 3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed
> out that grep in git-=bash produces an error message and core dumps
> given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E
> '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion
> "num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack
> Aborted (core dumped)|. Sorry, I can't copy the core off that machine
> for corporate reasons. Those git-bash tests were using |$ echo
> $BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0|
> |Regards, Ed Morton |
>




Information forwarded to bug-sed <at> gnu.org:
bug#68725; Package sed. (Tue, 06 Feb 2024 07:04:02 GMT) Full text and rfc822 format available.

This bug report was last modified 88 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.