GNU bug report logs - #20657
Traditional range expression not accepted in regex/dfa

Previous Next

Package: grep;

Reported by: arnold <at> skeeve.com

Date: Tue, 26 May 2015 02:43:02 UTC

Severity: wishlist

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20657 in the body.
You can then email your comments to 20657 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#20657; Package grep. (Tue, 26 May 2015 02:43:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to arnold <at> skeeve.com:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 26 May 2015 02:43:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: bug-grep <at> gnu.org
Subject: Traditional range expression not accepted in regex/dfa
Date: Tue, 26 May 2015 05:42:19 +0300
Hi.

I received a bug report for gawk by private email that a regexp of
this form: '[^0-9---]' wasn't accepted.  The bugaboo here is the "---"; it's
a range expression consisting of minus through minus, and apparently long
ago was how one got a minus into a bracket expression.

This can be seen in current grep also:

	$ ./src/grep --version
	./src/grep (GNU grep) 2.21
	Copyright (C) 2014 Free Software Foundation, Inc.
	...

	$ ./src/grep '[^0-9---]' /dev/null
	./src/grep: Invalid range end

The underlying regex and, I believe, dfa routines don't accept this.
Fixing either of them is beyond my skill range, so I thought I'd
pass this one upstream to you folks.

Thanks!

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#20657; Package grep. (Tue, 26 May 2015 06:54:02 GMT) Full text and rfc822 format available.

Message #8 received at 20657 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com, 20657 <at> debbugs.gnu.org
Subject: Re: bug#20657: Traditional range expression not accepted in regex/dfa
Date: Mon, 25 May 2015 23:53:31 -0700
arnold <at> skeeve.com wrote:

> The bugaboo here is the "---"; it's
> a range expression consisting of minus through minus, and apparently long
> ago was how one got a minus into a bracket expression.

Actually, long ago expressions like '[^0-9-]' worked just as they do now, and it 
wasn't ever necessary to use trailing "---".  That being said, it is true that 
in 7th Edition Unix '[^0-9---]' meant the same thing as '[^0-9-]', so in that 
sense we have an incompatibility with 7th Edition Unix here.

> 	$ ./src/grep '[^0-9---]' /dev/null
> 	./src/grep: Invalid range end
>
> The underlying regex and, I believe, dfa routines don't accept this.

Yes, that's correct.  It's not a bug, though, as the regexp is ambiguous and 
does not conform to POSIX, which says the following about RE bracket 
expressions: "To use a <hyphen> as the starting range point, it shall either 
come first in the bracket expression or be specified as a collating symbol; for 
example, "[][.-.]-0]", which matches either a <right-square-bracket> or any 
character or collating element that collates between <hyphen> and 0, inclusive." 
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05> 
In your correspondent's example, the hyphen is a starting range point but is 
neither first in the bracket expression nor is specified as a collating symbol, 
so the regexp doesn't conform to POSIX.

Even though it's not a bug I suppose it wouldn't hurt to make the GNU matchers 
compatible with 7th Edition Unix here, if someone really wants to take that task 
on; it's not urgent, though.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sat, 30 May 2015 20:05:04 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 22 Apr 2022 02:10:02 GMT) Full text and rfc822 format available.

Notification sent to arnold <at> skeeve.com:
bug acknowledged by developer. (Fri, 22 Apr 2022 02:10:03 GMT) Full text and rfc822 format available.

Message #15 received at 20657-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Arnold Robbins <arnold <at> skeeve.com>
Cc: bug-gnulib <at> gnu.org, 20657-done <at> debbugs.gnu.org, beebe <at> math.utah.edu
Subject: Re: Accepting [xyz---abc] - three minus signs to mean one
Date: Thu, 21 Apr 2022 19:08:55 -0700
[Message part 1 (text/plain, inline)]
On 4/21/22 00:57, Arnold Robbins wrote:

> As far as my testing indicates, dfa.c doesn't need a patch, it seems
> to accept "---" inside brackets for a single minus.

Yes, a brief perusal of the dfa.c source code suggests you're right. 
Thanks for looking into this. I tend to agree with you that POSIX is not 
likely to outlaw this extension.


> If there are no objections, can we get this into Gnulib?

Although the basic idea looks good, I see a few places where the patch 
can be improved.

* The two calls to re_string_peek_byte might go past the end of the 
pattern (a subscript violation). This is possible because the pattern is 
not necessarily null-terminated.

* The two calls to re_string_fetch_byte can be simplified into a single 
call to re_string_skip_bytes.

* No need to assign to token->opr.c, as it already has the correct value.

* Can fall through to the default case to save a bit of duplicate code.

* glibc still uses comments /* like this */ for style reasons, and we 
should stick to that.

I wrote a patch with these improvements in mind and installed it into 
Gnulib (see attached); hope it works for Gawk too.
[0001-regex-match-.-.-like-V7-grep.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#20657; Package grep. (Sun, 24 Apr 2022 13:22:01 GMT) Full text and rfc822 format available.

Message #18 received at 20657-done <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: eggert <at> cs.ucla.edu, arnold <at> skeeve.com
Cc: bug-gnulib <at> gnu.org, 20657-done <at> debbugs.gnu.org, beebe <at> math.utah.edu
Subject: Re: Accepting [xyz---abc] - three minus signs to mean one
Date: Sun, 24 Apr 2022 07:21:06 -0600
Hi Paul.

Thanks for this. The patch looks good. I will (eventually) merge it
into gawk instead of my change.

I plan to add a test to gawk; perhaps grep would benefit from one as well?

Thanks,

Arnold

Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 4/21/22 00:57, Arnold Robbins wrote:
>
> > As far as my testing indicates, dfa.c doesn't need a patch, it seems
> > to accept "---" inside brackets for a single minus.
>
> Yes, a brief perusal of the dfa.c source code suggests you're right. 
> Thanks for looking into this. I tend to agree with you that POSIX is not 
> likely to outlaw this extension.
>
>
> > If there are no objections, can we get this into Gnulib?
>
> Although the basic idea looks good, I see a few places where the patch 
> can be improved.
>
> * The two calls to re_string_peek_byte might go past the end of the 
> pattern (a subscript violation). This is possible because the pattern is 
> not necessarily null-terminated.
>
> * The two calls to re_string_fetch_byte can be simplified into a single 
> call to re_string_skip_bytes.
>
> * No need to assign to token->opr.c, as it already has the correct value.
>
> * Can fall through to the default case to save a bit of duplicate code.
>
> * glibc still uses comments /* like this */ for style reasons, and we 
> should stick to that.
>
> I wrote a patch with these improvements in mind and installed it into 
> Gnulib (see attached); hope it works for Gawk too.




Information forwarded to bug-grep <at> gnu.org:
bug#20657; Package grep. (Sun, 24 Apr 2022 19:08:02 GMT) Full text and rfc822 format available.

Message #21 received at 20657-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com
Cc: bug-gnulib <at> gnu.org, 20657-done <at> debbugs.gnu.org, beebe <at> math.utah.edu
Subject: Re: bug#20657: Accepting [xyz---abc] - three minus signs to mean one
Date: Sun, 24 Apr 2022 12:07:40 -0700
On 4/24/22 06:21, arnold <at> skeeve.com wrote:
> I plan to add a test to gawk; perhaps grep would benefit from one as well?

That'd need more than just a test, as we'd need to also modify regex.m4 
to arrange to replace any system regex that has this incompatibility 
with gnulib regex. And we'd need to document the extension since we 
shouldn't test undocumented features. Although such work could be done, 
I expect it'd be a more productive use of our limited time to get this 
extension into glibc first. I'll add that to my (long) list of things to do.




Information forwarded to bug-grep <at> gnu.org:
bug#20657; Package grep. (Mon, 25 Apr 2022 04:52:01 GMT) Full text and rfc822 format available.

Message #24 received at 20657-done <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: eggert <at> cs.ucla.edu, arnold <at> skeeve.com
Cc: bug-gnulib <at> gnu.org, 20657-done <at> debbugs.gnu.org, beebe <at> math.utah.edu
Subject: Re: bug#20657: Accepting [xyz---abc] - three minus signs to mean one
Date: Sun, 24 Apr 2022 22:51:13 -0600
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 4/24/22 06:21, arnold <at> skeeve.com wrote:
> > I plan to add a test to gawk; perhaps grep would benefit from one as well?
>
> That'd need more than just a test, as we'd need to also modify regex.m4 
> to arrange to replace any system regex that has this incompatibility 
> with gnulib regex. And we'd need to document the extension since we 
> shouldn't test undocumented features. Although such work could be done, 
> I expect it'd be a more productive use of our limited time to get this 
> extension into glibc first. I'll add that to my (long) list of things to do.

OK - I agree that getting this into glibc is higher priority.

Thanks,

Arnold




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 23 May 2022 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 310 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.