GNU bug report logs -
#20657
Traditional range expression not accepted in regex/dfa
Previous Next
Reported by: arnold <at> skeeve.com
Date: Tue, 26 May 2015 02:43:02 UTC
Severity: wishlist
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20657 in the body.
You can then email your comments to 20657 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#20657
; Package
grep
.
(Tue, 26 May 2015 02:43:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
arnold <at> skeeve.com
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Tue, 26 May 2015 02:43:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi.
I received a bug report for gawk by private email that a regexp of
this form: '[^0-9---]' wasn't accepted. The bugaboo here is the "---"; it's
a range expression consisting of minus through minus, and apparently long
ago was how one got a minus into a bracket expression.
This can be seen in current grep also:
$ ./src/grep --version
./src/grep (GNU grep) 2.21
Copyright (C) 2014 Free Software Foundation, Inc.
...
$ ./src/grep '[^0-9---]' /dev/null
./src/grep: Invalid range end
The underlying regex and, I believe, dfa routines don't accept this.
Fixing either of them is beyond my skill range, so I thought I'd
pass this one upstream to you folks.
Thanks!
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20657
; Package
grep
.
(Tue, 26 May 2015 06:54:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 20657 <at> debbugs.gnu.org (full text, mbox):
arnold <at> skeeve.com wrote:
> The bugaboo here is the "---"; it's
> a range expression consisting of minus through minus, and apparently long
> ago was how one got a minus into a bracket expression.
Actually, long ago expressions like '[^0-9-]' worked just as they do now, and it
wasn't ever necessary to use trailing "---". That being said, it is true that
in 7th Edition Unix '[^0-9---]' meant the same thing as '[^0-9-]', so in that
sense we have an incompatibility with 7th Edition Unix here.
> $ ./src/grep '[^0-9---]' /dev/null
> ./src/grep: Invalid range end
>
> The underlying regex and, I believe, dfa routines don't accept this.
Yes, that's correct. It's not a bug, though, as the regexp is ambiguous and
does not conform to POSIX, which says the following about RE bracket
expressions: "To use a <hyphen> as the starting range point, it shall either
come first in the bracket expression or be specified as a collating symbol; for
example, "[][.-.]-0]", which matches either a <right-square-bracket> or any
character or collating element that collates between <hyphen> and 0, inclusive."
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05>
In your correspondent's example, the hyphen is a starting range point but is
neither first in the bracket expression nor is specified as a collating symbol,
so the regexp doesn't conform to POSIX.
Even though it's not a bug I suppose it wouldn't hurt to make the GNU matchers
compatible with 7th Edition Unix here, if someone really wants to take that task
on; it's not urgent, though.
Severity set to 'wishlist' from 'normal'
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sat, 30 May 2015 20:05:04 GMT)
Full text and
rfc822 format available.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Fri, 22 Apr 2022 02:10:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
arnold <at> skeeve.com
:
bug acknowledged by developer.
(Fri, 22 Apr 2022 02:10:03 GMT)
Full text and
rfc822 format available.
Message #15 received at 20657-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 4/21/22 00:57, Arnold Robbins wrote:
> As far as my testing indicates, dfa.c doesn't need a patch, it seems
> to accept "---" inside brackets for a single minus.
Yes, a brief perusal of the dfa.c source code suggests you're right.
Thanks for looking into this. I tend to agree with you that POSIX is not
likely to outlaw this extension.
> If there are no objections, can we get this into Gnulib?
Although the basic idea looks good, I see a few places where the patch
can be improved.
* The two calls to re_string_peek_byte might go past the end of the
pattern (a subscript violation). This is possible because the pattern is
not necessarily null-terminated.
* The two calls to re_string_fetch_byte can be simplified into a single
call to re_string_skip_bytes.
* No need to assign to token->opr.c, as it already has the correct value.
* Can fall through to the default case to save a bit of duplicate code.
* glibc still uses comments /* like this */ for style reasons, and we
should stick to that.
I wrote a patch with these improvements in mind and installed it into
Gnulib (see attached); hope it works for Gawk too.
[0001-regex-match-.-.-like-V7-grep.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20657
; Package
grep
.
(Sun, 24 Apr 2022 13:22:01 GMT)
Full text and
rfc822 format available.
Message #18 received at 20657-done <at> debbugs.gnu.org (full text, mbox):
Hi Paul.
Thanks for this. The patch looks good. I will (eventually) merge it
into gawk instead of my change.
I plan to add a test to gawk; perhaps grep would benefit from one as well?
Thanks,
Arnold
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 4/21/22 00:57, Arnold Robbins wrote:
>
> > As far as my testing indicates, dfa.c doesn't need a patch, it seems
> > to accept "---" inside brackets for a single minus.
>
> Yes, a brief perusal of the dfa.c source code suggests you're right.
> Thanks for looking into this. I tend to agree with you that POSIX is not
> likely to outlaw this extension.
>
>
> > If there are no objections, can we get this into Gnulib?
>
> Although the basic idea looks good, I see a few places where the patch
> can be improved.
>
> * The two calls to re_string_peek_byte might go past the end of the
> pattern (a subscript violation). This is possible because the pattern is
> not necessarily null-terminated.
>
> * The two calls to re_string_fetch_byte can be simplified into a single
> call to re_string_skip_bytes.
>
> * No need to assign to token->opr.c, as it already has the correct value.
>
> * Can fall through to the default case to save a bit of duplicate code.
>
> * glibc still uses comments /* like this */ for style reasons, and we
> should stick to that.
>
> I wrote a patch with these improvements in mind and installed it into
> Gnulib (see attached); hope it works for Gawk too.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20657
; Package
grep
.
(Sun, 24 Apr 2022 19:08:02 GMT)
Full text and
rfc822 format available.
Message #21 received at 20657-done <at> debbugs.gnu.org (full text, mbox):
On 4/24/22 06:21, arnold <at> skeeve.com wrote:
> I plan to add a test to gawk; perhaps grep would benefit from one as well?
That'd need more than just a test, as we'd need to also modify regex.m4
to arrange to replace any system regex that has this incompatibility
with gnulib regex. And we'd need to document the extension since we
shouldn't test undocumented features. Although such work could be done,
I expect it'd be a more productive use of our limited time to get this
extension into glibc first. I'll add that to my (long) list of things to do.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20657
; Package
grep
.
(Mon, 25 Apr 2022 04:52:01 GMT)
Full text and
rfc822 format available.
Message #24 received at 20657-done <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 4/24/22 06:21, arnold <at> skeeve.com wrote:
> > I plan to add a test to gawk; perhaps grep would benefit from one as well?
>
> That'd need more than just a test, as we'd need to also modify regex.m4
> to arrange to replace any system regex that has this incompatibility
> with gnulib regex. And we'd need to document the extension since we
> shouldn't test undocumented features. Although such work could be done,
> I expect it'd be a more productive use of our limited time to get this
> extension into glibc first. I'll add that to my (long) list of things to do.
OK - I agree that getting this into glibc is higher priority.
Thanks,
Arnold
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 23 May 2022 11:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 310 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.