GNU bug report logs - #25048
--with-included-regex vs. e-acute piped into LC_ALL=fr_FR.iso88591 grep '[d-f]'

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 28 Nov 2016 04:58:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 25048 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#25048; Package grep. (Mon, 28 Nov 2016 04:58:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 28 Nov 2016 04:58:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Subject: --with-included-regex vs. e-acute piped into LC_ALL=fr_FR.iso88591
 grep '[d-f]'
Date: Sun, 27 Nov 2016 20:57:23 -0800
When grep is configured --with-included-regex, the following command
fails to print the expected match:

   printf '\351\n' |LC_ALL=fr_FR.iso88591 src/grep '[d-f]'

You wouldn't notice on glibc-based systems, since the default there is
to use the glibc-supplied regex code, which does make grep detect the
match.

However, on other systems (I noticed on OS X), configuration machinery
detects that we have to resort to the included regex matcher, and
there, the default build results in a grep binary that fails the new
unibyte-bracket-expr test.

Why? Because the included regcomp.c has two code paths: one for #if
_LIBC (that is collating-sequence aware), and the other that ignores
collation sequences. The former can be used only when building glibc
itself, and is the path we require in order to handle this case.  The
latter code is what we get when compiling any place else.

Since it's always been this way, I don't plan to attempt a work-around
before the next release, and instead will probably arrange for that
test to be skipped when grep is built with the included regex.

Other ideas welcome,

Jim




Information forwarded to bug-grep <at> gnu.org:
bug#25048; Package grep. (Mon, 28 Nov 2016 16:54:01 GMT) Full text and rfc822 format available.

Message #8 received at 25048 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Jim Meyering <jim <at> meyering.net>, 25048 <at> debbugs.gnu.org
Subject: Re: bug#25048: --with-included-regex vs. e-acute piped into
 LC_ALL=fr_FR.iso88591 grep '[d-f]'
Date: Mon, 28 Nov 2016 10:53:04 -0600
[Message part 1 (text/plain, inline)]
On 11/27/2016 10:57 PM, Jim Meyering wrote:
> When grep is configured --with-included-regex, the following command
> fails to print the expected match:
> 
>    printf '\351\n' |LC_ALL=fr_FR.iso88591 src/grep '[d-f]'

But the problem is that POSIX does NOT define what the "expected match"
should be. The very fact that you're using a non-C locale but passing a
range means that you have unspecified behavior per POSIX.  Some regex
engines treat 'e' and 'e-acute' as both being part of the range, others
treat only 'e' as being part of the range.  Expecting any particular
behavior is a bug, unless you know for sure that you are using GNU's
"rational range behavior" which explicitly treats ranges in ALL locales
the same as if they were in the C locale (that is, e-acute is never part
of the [d-f] range under rational range behavior).

> 
> Since it's always been this way, I don't plan to attempt a work-around
> before the next release, and instead will probably arrange for that
> test to be skipped when grep is built with the included regex.
> 
> Other ideas welcome,

We SHOULD be adjusting more and more GNU tools to honor rational range
behavior, at least as an option, even if that means that e-acute can
never be matched to [d-f].

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#25048; Package grep. (Mon, 28 Nov 2016 17:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 25048 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, Jim Meyering <jim <at> meyering.net>,
 25048 <at> debbugs.gnu.org
Subject: Re: bug#25048: --with-included-regex vs. e-acute piped into
 LC_ALL=fr_FR.iso88591 grep '[d-f]'
Date: Mon, 28 Nov 2016 09:13:02 -0800
On 11/28/2016 08:53 AM, Eric Blake wrote:
> We SHOULD be adjusting more and more GNU tools to honor rational range
> behavior

Yes, sorry, I forgot about that possibility when writing that test. I 
reverted the change to grep that added the test; this should fix the 
problem.





Information forwarded to bug-grep <at> gnu.org:
bug#25048; Package grep. (Mon, 28 Nov 2016 18:49:02 GMT) Full text and rfc822 format available.

Message #14 received at 25048 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: jim <at> meyering.net, eblake <at> redhat.com, 25048 <at> debbugs.gnu.org
Subject: Re: bug#25048: --with-included-regex vs. e-acute piped into
 LC_ALL=fr_FR.iso88591 grep '[d-f]'
Date: Mon, 28 Nov 2016 11:48:11 -0700
> We SHOULD be adjusting more and more GNU tools to honor rational range
> behavior,

Hear, hear!  (Or "+1" in 21st Century English.)

The official term, coined by Karl Berry and as documented in the gawk
manual, is "Rational Range Interpretation".  :-) :-)

> at least as an option, even if that means that e-acute can
> never be matched to [d-f].

Now, if we could get GLIBC to move to that, we'd have something.

I've tried to submit patches in the past that weren't accepted,
but maybe it's worth trying again.

At least gawk and gnulib-based programs generally do so.

Arnold




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 18 Dec 2016 21:40:02 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 130 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.