GNU bug report logs - #33837
Unexpected result for regex with non-ascii range

Package: grep;

Reported by: Reinis Danne <rei4dan <at> gmail.com>

Date: Sat, 22 Dec 2018 21:34:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 33837 in the body.
You can then email your comments to 33837 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#33837; Package grep. (Sat, 22 Dec 2018 21:34:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Reinis Danne <rei4dan <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 22 Dec 2018 21:34:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Reinis Danne <rei4dan <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Unexpected result for regex with non-ascii range
Date: Sat, 22 Dec 2018 21:43:46 +0200

Hi!

grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
of yY for lv_LV.UTF-8 locale (by implementing rational range
interpretation?) [1].

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774

However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
Ž
$ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
a
āĀb
c
čČd
e
ēĒf
g
ģĢh
i
īĪy
j
k
ķĶl
ļĻm
n
ņŅo
ōŌp
q
r
ŗŖs
šŠt
u
ūŪv
w
x
z
žŽ

For the uppercase the result is completely bogus, but for the lowercase range
it seems that accented uppercase letters are interleaved with the
lowercase ones.

I would expect all letters to have their uppercase variants de-interleaved here.

I don't know if grep alters the collation rules or it is done by glibc (2.28).
strxfrm() gives me this result:
Using LC_COLLATE=lv_LV.UTF-8
char    strxfrm
i    c2b7010201020101e29b96
I    c2b7010201070101e2afb7
ī    c2b70102140102020101e29bb7
Ī    c2b70102140107020101e2b096
y    c2b701030102
Y    c2b701030107
j    c382010201020101e29c96
J    c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char    strxfrm
i    6b
I    4b
ī    c4ad
Ī    c4ac
y    7b
Y    5b
j    6c
J    4c


Reinis

Information forwarded to bug-grep <at> gnu.org:
bug#33837; Package grep. (Sun, 23 Dec 2018 20:19:01 GMT) Full text and rfc822 format available.

Message #8 received at 33837 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: rei4dan <at> gmail.com
Cc: 33837 <at> debbugs.gnu.org
Subject: Re: bug#33837: Unexpected result for regex with non-ascii range
Date: Sun, 23 Dec 2018 12:17:52 -0800

tags 33873 notabug
close 33873
stop

On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4dan <at> gmail.com> wrote:
> grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> of yY for lv_LV.UTF-8 locale (by implementing rational range
> interpretation?) [1].
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
>
> However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
> $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> Ž
> $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> a
> āĀb
> c
> čČd
...
>
> For the uppercase the result is completely bogus, but for the lowercase range
> it seems that accented uppercase letters are interleaved with the
> lowercase ones.
>
> I would expect all letters to have their uppercase variants de-interleaved here.
>
> I don't know if grep alters the collation rules or it is done by glibc (2.28).
> strxfrm() gives me this result:
> Using LC_COLLATE=lv_LV.UTF-8
> char    strxfrm
> i    c2b7010201020101e29b96
> I    c2b7010201070101e2afb7
...

Thanks for the report. However, ...
Using a multi-byte character as a range endpoint elicits what the
standards documents call "unspecified behavior".

Quoting grep's own manual,

> Within a bracket expression, a "range expression" consists of two characters separated by a hyphen.  It matches any single character that sorts between the two characters, inclusive.  In the default C locale, the sorting sequence is the native character order; for example, '[a-d]' is equivalent to '[abcd]'.  In other locales, the sorting sequence is not specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to match any character, or the set of characters that it matches might even be erratic.  To obtain the traditional interpretation of bracket expressions, you can use the 'C' locale by setting the 'LC_ALL' environment variable to the value 'C'.

For the record, POSIX says this:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:

> Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges have unspecified behavior.

I am marking the auto-created issue as "not-a-bug", and can't even
(reasonably) label it as "wishlist", because allowing what your usage
implies is fundamentally contradictory.

You're welcome to continue the discussion here.

Information forwarded to bug-grep <at> gnu.org:
bug#33837; Package grep. (Sun, 23 Dec 2018 21:07:02 GMT) Full text and rfc822 format available.

Message #11 received at 33837 <at> debbugs.gnu.org (full text, mbox):

From: Reinis Danne <rei4dan <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 33837 <at> debbugs.gnu.org
Subject: Re: bug#33837: Unexpected result for regex with non-ascii range
Date: Sun, 23 Dec 2018 23:06:40 +0200

svētd., 2018. g. 23. dec., plkst. 22:18 — lietotājs Jim Meyering
(<jim <at> meyering.net>) rakstīja:
>
> tags 33873 notabug
> close 33873
> stop
>
> On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <rei4dan <at> gmail.com> wrote:
> > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> > of yY for lv_LV.UTF-8 locale (by implementing rational range
> > interpretation?) [1].
> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
> >
> > However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
> > $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> > Ž
> > $ echo aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> > a
> > āĀb
> > c
> > čČd
> ...
> >
> > For the uppercase the result is completely bogus, but for the lowercase range
> > it seems that accented uppercase letters are interleaved with the
> > lowercase ones.
> >
> > I would expect all letters to have their uppercase variants de-interleaved here.
> >
> > I don't know if grep alters the collation rules or it is done by glibc (2.28).
> > strxfrm() gives me this result:
> > Using LC_COLLATE=lv_LV.UTF-8
> > char    strxfrm
> > i    c2b7010201020101e29b96
> > I    c2b7010201070101e2afb7
> ...
>
> Thanks for the report. However, ...
> Using a multi-byte character as a range endpoint elicits what the
> standards documents call "unspecified behavior".
>
> Quoting grep's own manual,
>
> > Within a bracket expression, a "range expression" consists of two characters separated by a hyphen.  It matches any single character that sorts between the two characters, inclusive.  In the default C locale, the sorting sequence is the native character order; for example, '[a-d]' is equivalent to '[abcd]'.  In other locales, the sorting sequence is not specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to match any character, or the set of characters that it matches might even be erratic.  To obtain the traditional interpretation of bracket expressions, you can use the 'C' locale by setting the 'LC_ALL' environment variable to the value 'C'.
>
> For the record, POSIX says this:
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:
>
> > Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges have unspecified behavior.
>
> I am marking the auto-created issue as "not-a-bug", and can't even
> (reasonably) label it as "wishlist", because allowing what your usage
> implies is fundamentally contradictory.
>
> You're welcome to continue the discussion here.

Thank you for the response.

I had read that document before. I didn't realize that sorting order
and collation order are two different things, or rather that
alphabetic sorting would imply collation while sorting order the
manual was talking about refers to comparison of code point numerical
values.

Added tag(s) notabug. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 02 Jan 2020 09:01:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 33837 <at> debbugs.gnu.org and Reinis Danne <rei4dan <at> gmail.com> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Thu, 02 Jan 2020 09:01:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 30 Jan 2020 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 171 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #33837 Unexpected result for regex with non-ascii range

GNU bug report logs - #33837
Unexpected result for regex with non-ascii range