GNU bug report logs - #68726
GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE capture groups and backreferences

Previous Next

Package: sed;

Reported by: Ed Morton <mortoneccc <at> comcast.net>

Date: Fri, 26 Jan 2024 04:17:02 UTC

Severity: normal

To reply to this bug, email your comments to 68726 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#68726; Package sed. (Fri, 26 Jan 2024 04:17:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ed Morton <mortoneccc <at> comcast.net>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Fri, 26 Jan 2024 04:17:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ed Morton <mortoneccc <at> comcast.net>
To: bug-sed <at> gnu.org, bug-grep <at> gnu.org
Subject: GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE
 capture groups and backreferences
Date: Thu, 25 Jan 2024 10:46:34 -0600
[Message part 1 (text/plain, inline)]
There are issues (mostly common but some not) using a regexp like this:

   |^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$|

with GNU grep and GNU sed, hence my contacting both mailing lists but 
apologies if that was the wrong starting point.

This started out as a question on StackOverflow, 
(https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446) 
but my "answer" and some comments from there copied below so you don't 
have to look anywhere else for a description of the issues.

Given this input file:

|a|
|ab|
|abba|
|abcdef|
|abcba|
|zufolo|
|||Removing the `$` from the end of the regexp (i.e. making it less 
restrictive) produces fewer matches, which is the opposite of what it 
should do: a) With the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba It's not just 
GNU grep that behaves strangely, GNU sed has the same behavior from the 
question when just matching with `sed -nE '/.../p' sample` as GNU `grep` 
does AND sed behaves differently if we're just doing a match vs if we're 
doing a match + replace. For example here's `sed` doing a 
match+replacement and behaving the same way as `grep` above: a) With the 
`$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b) 
Without the `$` at the end of the regexp: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's 
sed just doing a match and behaving differently from any of the above: 
a) With the `$` at the end of the regexp (note the extra `ab` in the 
output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab 
abba abcba zufolo b) Without the `$` at the end of the regexp (note the 
extra `ab` and `abcdef` in the output): $ sed -nE 
'/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba 
zufolo Also interestingly this: $ sed -nE 
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <a> <abba> 
<abcba> <>zufolo the last line of which means the regexp is apparently 
matching the start of the line and ignoring the `$` end-of-string 
metachar present in the regexp! The odd behavior isn't just associated 
with using `-E`, though, if I remove `-E` and just use [POSIX compliant 
BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03) 
then: a) With the `$` at the end of the regexp: $ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' 
sample a abba abcba zufolo <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' 
sample a abba abcba zufolo b) Without the `$` at the end of the regexp: 
$ grep 
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' 
sample a abba abcba <p> $ sed -n 
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' 
sample a abba abcba and again just doing a match in sed below behaves 
differently from the sed match+replacements above: a) With the `$` at 
the end of the regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' 
sample a ab abba abcba zufolo b) Without the `$` at the end of the 
regexp: $ sed -n 
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' 
sample a ab abba abcdef abcba zufolo The above shows that, given the 
same regexp, sed is apparently matching different strings depending on 
whether it's doing a substitution or not. These are the version I was 
using when testing above: $ grep --version | head -1 grep (GNU grep) 
3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed 
out that grep in git-=bash produces an error message and core dumps 
given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E 
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion 
"num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack 
Aborted (core dumped)|. Sorry, I can't copy the core off that machine 
for corporate reasons. Those git-bash tests were using |$ echo 
$BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0|
|Regards, Ed Morton |
[Message part 2 (text/html, inline)]

This bug report was last modified 99 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.