GNU bug report logs -
#68726
GNU grep and sed behaving unexpectedly with multiple 1-or-0 RE capture groups and backreferences
Previous Next
To reply to this bug, email your comments to 68726 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-sed <at> gnu.org
:
bug#68726
; Package
sed
.
(Fri, 26 Jan 2024 04:17:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ed Morton <mortoneccc <at> comcast.net>
:
New bug report received and forwarded. Copy sent to
bug-sed <at> gnu.org
.
(Fri, 26 Jan 2024 04:17:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
There are issues (mostly common but some not) using a regexp like this:
|^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$|
with GNU grep and GNU sed, hence my contacting both mailing lists but
apologies if that was the wrong starting point.
This started out as a question on StackOverflow,
(https://stackoverflow.com/questions/77820540/searching-palindromes-with-grep-e-egrep/77861446?noredirect=1#comment137299746_77861446)
but my "answer" and some comments from there copied below so you don't
have to look anywhere else for a description of the issues.
Given this input file:
|a|
|ab|
|abba|
|abcdef|
|abcba|
|zufolo|
|||Removing the `$` from the end of the regexp (i.e. making it less
restrictive) produces fewer matches, which is the opposite of what it
should do: a) With the `$` at the end of the regexp: $ grep -E
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample a abba abcba zufolo b)
Without the `$` at the end of the regexp: $ grep -E
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample a abba abcba It's not just
GNU grep that behaves strangely, GNU sed has the same behavior from the
question when just matching with `sed -nE '/.../p' sample` as GNU `grep`
does AND sed behaves differently if we're just doing a match vs if we're
doing a match + replace. For example here's `sed` doing a
match+replacement and behaving the same way as `grep` above: a) With the
`$` at the end of the regexp: $ sed -nE
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample a abba abcba zufolo b)
Without the `$` at the end of the regexp: $ sed -nE
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample a abba abcba but here's
sed just doing a match and behaving differently from any of the above:
a) With the `$` at the end of the regexp (note the extra `ab` in the
output): $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample a ab
abba abcba zufolo b) Without the `$` at the end of the regexp (note the
extra `ab` and `abcdef` in the output): $ sed -nE
'/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample a ab abba abcdef abcba
zufolo Also interestingly this: $ sed -nE
's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample outputs: <a> <abba>
<abcba> <>zufolo the last line of which means the regexp is apparently
matching the start of the line and ignoring the `$` end-of-string
metachar present in the regexp! The odd behavior isn't just associated
with using `-E`, though, if I remove `-E` and just use [POSIX compliant
BREs](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03)
then: a) With the `$` at the end of the regexp: $ grep
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$'
sample a abba abcba zufolo <p> $ sed -n
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p'
sample a abba abcba zufolo b) Without the `$` at the end of the regexp:
$ grep
'^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1'
sample a abba abcba <p> $ sed -n
's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p'
sample a abba abcba and again just doing a match in sed below behaves
differently from the sed match+replacements above: a) With the `$` at
the end of the regexp: $ sed -n
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p'
sample a ab abba abcba zufolo b) Without the `$` at the end of the
regexp: $ sed -n
'/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p'
sample a ab abba abcdef abcba zufolo The above shows that, given the
same regexp, sed is apparently matching different strings depending on
whether it's doing a substitution or not. These are the version I was
using when testing above: $ grep --version | head -1 grep (GNU grep)
3.11 $ sed --version | head -1 sed (GNU sed) 4.9 It was later pointed
out that grep in git-=bash produces an error message and core dumps
given the original regexp above|, e.g. |grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample| and |grep -E
'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample| both output|: a assertion
"num >= 0" failed: file "regexec.c", line 1394, function: pop_fail_stack
Aborted (core dumped)|. Sorry, I can't copy the core off that machine
for corporate reasons. Those git-bash tests were using |$ echo
$BASH_VERSION| |5.2.15(1)-release ||$ grep --version||grep (GNU grep) 3.0|
|Regards, Ed Morton |
[Message part 2 (text/html, inline)]
This bug report was last modified 284 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.