GNU logs - #11621, boring messages

Message sent to bug-coreutils@HIDDEN:

Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Resent-From: Linda Walsh <coreutils@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Sun, 03 Jun 2012 22:16:02 +0000
Resent-Message-ID: <handler.11621.B.133876173218443 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
To: 11621 <at> debbugs.gnu.org
Message-ID: <4FCBE17F.40102@HIDDEN>
Date: Sun, 03 Jun 2012 15:13:19 -0700
From: Linda Walsh <coreutils@HIDDEN>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US;
	rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24
	Mnenhy/0.7.6.666
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org

Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).

Additionally many distro's have switched to UTF-8 resulting in
localizations like en_GB.UTF-8, en_US.UTF-8, etc...

There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set 
(iso-8859-x,
or others).

In Unicode, it is recommended that upper case be uniformly sorted
below lower case (section 6.6, http://www.unicode.org/reports/tr10/).

A chart, including accent variations is at

 http://unicode.org/charts/case/chart_Latin.htm.

Temporarily ignoring accents, only talking about lower and upper
case letters, you will note that the sorting order of A=41, B=42, C=43,
while the lower case letters from 'a', have weights a=61, b=62, c=63.

This uniformly puts all lower case letters "after" any upper case letters.

Thus -- I am asserting, that any computer using a local for country
preferences, BUT is also using a unicode character set (e.g. UTF-8),
should return sorted results as specified by the character set.

I.e. the utility 'sort' (and any programs that use the collation/sorting
order specified in the core-utils libs) should return A-Z < a-z.

This is currently not the case and is leading to erroneous results
in programs written before locales were considered.  The thing is --
in many cases, within some short period of locales being implemented,
many or most distro's also switched to UTF-8.

Unfortunately it's collation order has not been respected.

I would assert this is a serious bug that should be addressed ASAP...

Thanks,
Linda W.

Message sent:

Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.428 (Entity 5.428)
Content-Type: text/plain; charset=utf-8
X-Loop: help-debbugs@HIDDEN
From: help-debbugs@HIDDEN (GNU bug Tracking System)
To: Linda Walsh <coreutils@HIDDEN>
Subject: bug#11621: Acknowledgement (questionable locale sorting order
 (especially as related to char ranges in REs))
Message-ID: <handler.11621.B.133876173218443.ack <at> debbugs.gnu.org>
References: <4FCBE17F.40102@HIDDEN>
X-Gnu-PR-Message: ack 11621
X-Gnu-PR-Package: coreutils
Reply-To: 11621 <at> debbugs.gnu.org
Date: Sun, 03 Jun 2012 22:16:03 +0000

Thank you for filing a new bug report with debbugs.gnu.org.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 bug-coreutils@HIDDEN

If you wish to submit further information on this problem, please
send it to 11621 <at> debbugs.gnu.org.

Please do not send mail to help-debbugs@HIDDEN unless you wish
to report a problem with the Bug-tracking system.

--=20
11621: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D11621
GNU Bug Tracking System
Contact help-debbugs@HIDDEN with problems

Message sent to bug-coreutils@HIDDEN:

Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Sun, 03 Jun 2012 23:00:02 +0000
Resent-Message-ID: <handler.11621.B11621.133876434822279 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
To: Linda Walsh <coreutils@HIDDEN>
Cc: 11621 <at> debbugs.gnu.org
Message-ID: <4FCBEBC0.90200@HIDDEN>
Date: Sun, 03 Jun 2012 23:57:04 +0100
From: =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:6.0) Gecko/20110816 Thunderbird/6.0
MIME-Version: 1.0
References: <4FCBE17F.40102@HIDDEN>
In-Reply-To: <4FCBE17F.40102@HIDDEN>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org

On 06/03/2012 11:13 PM, Linda Walsh wrote:
> Within in the past few years, use of ranges in RE's has become
> unreliable due to some locale changes sorting their native character
> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
> 
> Additionally many distro's have switched to UTF-8 resulting in
> localizations like en_GB.UTF-8, en_US.UTF-8, etc...
> 
> There seems to be a problem in when a user has set their system to use
> Unicode, it is no longer using the locale specific character set (iso-8859-x,
> or others).

It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1
a
A
á
b

> In Unicode, it is recommended that upper case be uniformly sorted
> below lower case (section 6.6, http://www.unicode.org/reports/tr10/).
> 
> A chart, including accent variations is at
> 
> http://unicode.org/charts/case/chart_Latin.htm.

http://unicode.org/charts/case/chart_Latin.html

> Temporarily ignoring accents, only talking about lower and upper
> case letters, you will note that the sorting order of A=41, B=42, C=43,
> while the lower case letters from 'a', have weights a=61, b=62, c=63.
> 
> This uniformly puts all lower case letters "after" any upper case letters.
> 
> Thus -- I am asserting, that any computer using a locale for country
> preferences, BUT is also using a unicode character set (e.g. UTF-8),
> should return sorted results as specified by the character set.
> 
> I.e. the utility 'sort' (and any programs that use the collation/sorting
> order specified in the core-utils libs) should return A-Z < a-z.

Well case comparison is a complicated area.

For the special case of discounting accented chars etc.
you can use an attribute of the well designed UTF-8.
Enabling traditional byte comparison on (normalized) UTF-8 data
will result in data sorted in Unicode code point order:

$ printf "%s\n" A b a á | LC_ALL=C sort
A
a
b
á

> This is currently not the case and is leading to erroneous results
> in programs written before locales were considered.  The thing is --
> in many cases, within some short period of locales being implemented,
> many or most distro's also switched to UTF-8.
> 
> Unfortunately it's collation order has not been respected.
> 
> I would assert this is a serious bug that should be addressed ASAP...

As for the question in the subject for handling ranges in REs,
there has been recent work in changing as you suggest:

http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105

cheers,
Pádraig.

Message sent to bug-coreutils@HIDDEN:

Subject: bug#11621: questionable locale sorting order (especially as related	to char ranges in REs)
Resent-From: "Linda A. Walsh" <lkml@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Mon, 04 Jun 2012 05:06:01 +0000
Resent-Message-ID: <handler.11621.B11621.133878634127558 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
To: =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
Cc: 11621 <at> debbugs.gnu.org
Message-ID: <4FCC41AA.3040306@HIDDEN>
Date: Sun, 03 Jun 2012 22:03:38 -0700
From: "Linda A. Walsh" <lkml@HIDDEN>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US;
	rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24
	Mnenhy/0.7.6.666
MIME-Version: 1.0
References: <4FCBE17F.40102@HIDDEN> <4FCBEBC0.90200@HIDDEN>
In-Reply-To: <4FCBEBC0.90200@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Precedence: list
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org



P=C3=A1draig Brady wrote:
> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>> Within in the past few years, use of ranges in RE's has become
>> unreliable due to some locale changes sorting their native character
>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>
>> There seems to be a problem in when a user has set their system to use
>> Unicode, it is no longer using the locale specific character set (iso-=
8859-x,
>> or others).
----
	To clarify my above statement:


    There seems to be a problem in when a user has set their system to us=
e
Unicode: It is no longer using the locale specific character set (iso-885=
9-x,
or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a colla=
tion
order -- I don't know that they others do ('C' does, but I don't know abo=
ut
other locale-specific character sets).


> It's not specific to "unicode". Sorting in a iso-8859-1 charset
> results in locale ordering:
----
	Can you cite a source specifying the sort/collation order of the
iso-8859-1 charset that would prove that it is not-conforming to the coll=
ation
specification for that charset?

	I.e. If there is no official source, then the order with that charset
is "undefined", and while it may not be desirable, returning a<A<b<B, wou=
ld not
be "an error".




>> http://unicode.org/charts/case/chart_Latin.htm.
>=20
> http://unicode.org/charts/case/chart_Latin.html
---
	^^Correct^^ (typho)

>> Temporarily ignoring accents, only talking about lower and upper
>> case letters, ...
>=20
> Well case comparison is a complicated area.
----
	A bit, but it's mostly just wrong in the gnu library concerning unicode,=
 and,
as you are pointing out -- the 'C' encoding as well.
the 'C' locale was the original charset used by the 'C' language -- only =
8 bits
wide.

	So how can it sort characters beyond the lower 256?
This would seem to be meaningless and bugs output.


Is it?...   When the case comparison ordering is specified in a
standard, it makes it fairly clear that one is either compliant with the =
standard
or not.

	In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant.

	What happens in other charsets may or may not be covered under some
other standard -- e.g. the 'C'/ascii ordering is specified.  But I don't =
know
if others have relevant standards or not.

>=20
> For the special case of discounting accented chars etc.
> you can use an attribute of the well designed UTF-8.
---
	This is not exactly the point -- the point is that the core sort
DOESN'T use that ordering.  That's the bug I am reporting.

	In reporting this, I'm trying to keep the argument 'simple' and focus on
the problem of widely used ranges in the first 256 code-points of
Unicode.

	Unicode gives a fairly extensive algorithm for handling accents,
but I didn't want to complicate the discussion by "going there".  Please
focus this bug on the lower 128 code points, as full unicode compliance
with the full collation algorithm that is specified is likely to be a
larger task.  HOWEVER, fixing the sorting/collation order of the lower
127 code points, is, comparatively a small task that conceivably could be
fixed in the next release.


> Enabling traditional byte comparison on (normalized) UTF-8 data
> will result in data sorted in Unicode code point order:
> A b a =C3=A1 =3D> A a b =C3=A1

But you are missing the point (as well as raising an interesting 'feature=
'(?bug?)).

How is it that 'C' collation collates characters that are outside the asc=
ii range?

I.e. -- you can't interpret input data as 'unicode' in the 'C' locale.
So how does this work in the 'C' local?  AND more importantly -- it SHOUL=
D work
when charset is unicode (UTF-8)... and does not.  Test prog:
---------------
#!/bin/bash
set -m
# vals to test:
declare -a vals=3D( A a B b X x Y y Z z =E2=85=A7  =E2=85=A4 =E2=85=A2 =E2=
=85=A0 =E2=85=AF =E2=85=AD =E2=85=B6  =E2=85=BC =E2=85=B2 )
COLLATE_ORDER=3DC

function isatty {
	local fd=3D${1:-1} ;
	0<&$fd tty -s
}

function ord {
   local nl=3D"";
	isatty && nl=3D"\n"
	printf "%d$nl" "'$1"
}

function background_print {
	readarray -t inp
	for ch in "${inp[@]}"; {
		printf "%s   (U+%x)\n" "$ch" "$(ord "$ch")"
	}
}


printf "%s\n" "${vals[@]}" |
		LC_COLLATE=3D$COLLATE_ORDER sort |
		background_print

------------------------------------

Note, that the above produces:

/tmp/stest
=E2=85=A7   (U+2167)
=E2=85=A4   (U+2164)
=E2=85=A2   (U+2162)
=E2=85=A0   (U+2160)
=E2=85=AF   (U+216f)
=E2=85=AD   (U+216d)
=E2=85=B6   (U+2176)
=E2=85=BC   (U+217c)
=E2=85=B2   (U+2172)
a   (U+61)
A   (U+41)
b   (U+62)
B   (U+42)
x   (U+78)
X   (U+58)
y   (U+79)
Y   (U+59)
z   (U+7a)
Z   (U+5a)

NOT the output you showed...Seems there's a bug in the C collation order?

Changing collation order to UTF-8:

Same thing:
  /tmp/stest
=E2=85=A7   (U+2167)
=E2=85=A4   (U+2164)
=E2=85=A2   (U+2162)
=E2=85=A0   (U+2160)
=E2=85=AF   (U+216f)
=E2=85=AD   (U+216d)
=E2=85=B6   (U+2176)
=E2=85=BC   (U+217c)
=E2=85=B2   (U+2172)
a   (U+61)
A   (U+41)
b   (U+62)
B   (U+42)
x   (U+78)
X   (U+58)
y   (U+79)
Y   (U+59)
z   (U+7a)
Z   (U+5a)


>> I would assert this is a serious bug that should be addressed ASAP...
>=20
> As for the question in the subject for handling ranges in REs,
> there has been recent work in changing as you suggest:
>=20
> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
----

	Recent?
The most recent posts on that thread look to be from June of last year.
I.e. a year ago.

I'm trying to stay focused on specific problems -- UTF-8 ordering is defi=
ned.
the gnu library doesn't follow it.

Major problem with so many progs relying on the lib!...

Message sent to bug-coreutils@HIDDEN:

X-Loop: help-debbugs@HIDDEN
Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Mon, 04 Jun 2012 08:51:02 +0000
Resent-Message-ID: <handler.11621.B11621.133879985917929 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 11621
X-GNU-PR-Package: coreutils
X-GNU-PR-Keywords: 
To: "Linda A. Walsh" <lkml@HIDDEN>
Cc: 11621 <at> debbugs.gnu.org
Received: via spool by 11621-submit <at> debbugs.gnu.org id=B11621.133879985917929
          (code B ref 11621); Mon, 04 Jun 2012 08:51:02 +0000
Received: (at 11621) by debbugs.gnu.org; 4 Jun 2012 08:50:59 +0000
Received: from localhost ([127.0.0.1]:56571 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1SbT07-0004f7-39
	for submit <at> debbugs.gnu.org; Mon, 04 Jun 2012 04:50:59 -0400
Received: from mail3.vodafone.ie ([213.233.128.45]:40696)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <P@HIDDEN>) id 1SbT03-0004et-9K
	for 11621 <at> debbugs.gnu.org; Mon, 04 Jun 2012 04:50:57 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApMBACV1zE9tTjL9/2dsb2JhbAANOIVOsXYBAQEDAQECIA8BRgULCQINCwICBRYLAgIJAwIBAgEWLwYNAQcBAQWHfRCkH5F5gSOJboR+gRIDknWDNYRjjGWBVSM
Received: from unknown (HELO [192.168.1.79]) ([109.78.50.253])
	by mail3.vodafone.ie with ESMTP; 04 Jun 2012 09:48:53 +0100
Message-ID: <4FCC7674.1090705@HIDDEN>
Date: Mon, 04 Jun 2012 09:48:52 +0100
From: =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:6.0) Gecko/20110816 Thunderbird/6.0
MIME-Version: 1.0
References: <4FCBE17F.40102@HIDDEN> <4FCBEBC0.90200@HIDDEN>
	<4FCC41AA.3040306@HIDDEN>
In-Reply-To: <4FCC41AA.3040306@HIDDEN>
X-Enigmail-Version: 1.3.2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: -1.9 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)

On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
> 
> 
> Pádraig Brady wrote:
>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>> Within in the past few years, use of ranges in RE's has become
>>> unreliable due to some locale changes sorting their native character
>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>
>>> There seems to be a problem in when a user has set their system to use
>>> Unicode, it is no longer using the locale specific character set (iso-8859-x,
>>> or others).
> ----
>     To clarify my above statement:
> 
> 
>    There seems to be a problem in when a user has set their system to use
> Unicode: It is no longer using the locale specific character set (iso-8859-x,
> or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
> order -- I don't know that they others do ('C' does, but I don't know about
> other locale-specific character sets).
> 
> 
>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>> results in locale ordering:
> ----
>     Can you cite a source specifying the sort/collation order of the
> iso-8859-1 charset that would prove that it is not-conforming to the collation
> specification for that charset?
> 
>     I.e. If there is no official source, then the order with that charset
> is "undefined", and while it may not be desirable, returning a<A<b<B, would not
> be "an error".

It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presented.
But to be explicit:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f iso-8859-1
a
A
á
b

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=C sort | iconv -f iso-8859-1
A
a
b
á

> 
> 
> 
> 
>>> http://unicode.org/charts/case/chart_Latin.htm.
>>
>> http://unicode.org/charts/case/chart_Latin.html
> ---
>     ^^Correct^^ (typho)
> 
>>> Temporarily ignoring accents, only talking about lower and upper
>>> case letters, ...
>>
>> Well case comparison is a complicated area.
> ----
>     A bit, but it's mostly just wrong in the gnu library concerning unicode, and,
> as you are pointing out -- the 'C' encoding as well.
> the 'C' locale was the original charset used by the 'C' language -- only 8 bits
> wide.
> 
>     So how can it sort characters beyond the lower 256?
> This would seem to be meaningless and bugs output.

http://www.pixelbeat.org/docs/utf8_programming.html

> Is it?...   When the case comparison ordering is specified in a
> standard, it makes it fairly clear that one is either compliant with the standard
> or not.
> 
>     In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant.
> 
>     What happens in other charsets may or may not be covered under some
> other standard -- e.g. the 'C'/ascii ordering is specified.  But I don't know
> if others have relevant standards or not.
> 
>>
>> For the special case of discounting accented chars etc.
>> you can use an attribute of the well designed UTF-8.
> ---
>     This is not exactly the point -- the point is that the core sort
> DOESN'T use that ordering.  That's the bug I am reporting.

Well you can't generally exclude accents.

> 
>     In reporting this, I'm trying to keep the argument 'simple' and focus on
> the problem of widely used ranges in the first 256 code-points of
> Unicode.
> 
>     Unicode gives a fairly extensive algorithm for handling accents,
> but I didn't want to complicate the discussion by "going there".  Please
> focus this bug on the lower 128 code points, as full unicode compliance
> with the full collation algorithm that is specified is likely to be a
> larger task.  HOWEVER, fixing the sorting/collation order of the lower
> 127 code points, is, comparatively a small task that conceivably could be
> fixed in the next release.

lower 127 = ASCII. If your input data is ASCII, just use LC_ALL=C.

>> Enabling traditional byte comparison on (normalized) UTF-8 data
>> will result in data sorted in Unicode code point order:
>> A b a á => A a b á
> 
> But you are missing the point (as well as raising an interesting 'feature'(?bug?)).
> 
> How is it that 'C' collation collates characters that are outside the ascii range?

Well whether C should be a "unicode" or "ascii" charset is a whole different
kettle of fish. I was just referring (as per the link above), that
UTF8 is well designed so that it works with many traditional single byte functions.

> I.e. -- you can't interpret input data as 'unicode' in the 'C' locale.
> So how does this work in the 'C' local?  AND more importantly -- it SHOULD work
> when charset is unicode (UTF-8)... and does not.  Test prog:
> ---------------
> #!/bin/bash
> set -m
> # vals to test:
> declare -a vals=( A a B b X x Y y Z z Ⅷ  Ⅴ Ⅲ Ⅰ Ⅿ Ⅽ ⅶ  ⅼ ⅲ )
> COLLATE_ORDER=C
> 
> function isatty {
>     local fd=${1:-1} ;
>     0<&$fd tty -s
> }
> 
> function ord {
>   local nl="";
>     isatty && nl="\n"
>     printf "%d$nl" "'$1"
> }
> 
> function background_print {
>     readarray -t inp
>     for ch in "${inp[@]}"; {
>         printf "%s   (U+%x)\n" "$ch" "$(ord "$ch")"
>     }
> }
> 
> 
> printf "%s\n" "${vals[@]}" |
>         LC_COLLATE=$COLLATE_ORDER sort |
>         background_print
> 
> ------------------------------------
> 
> Note, that the above produces:
> 
> /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> NOT the output you showed...Seems there's a bug in the C collation order?

Note C doesn't use a collation order, it's simple byte comparison.
Seems there may be a bug in your script?
Also ensure that LC_ALL is not set, which will override LC_COLLATE.

$ printf "%s\n" A a B b 2 1 Ⅷ  ⅶ ⅲ | LC_COLLATE=C sort
1
2
A
B
a
b
Ⅷ
ⅲ
ⅶ

> 
> Changing collation order to UTF-8:
> 
> Same thing:
>  /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> 
>>> I would assert this is a serious bug that should be addressed ASAP...
>>
>> As for the question in the subject for handling ranges in REs,
>> there has been recent work in changing as you suggest:
>>
>> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
> ----
> 
>     Recent?

?

> The most recent posts on that thread look to be from June of last year.
> I.e. a year ago.
> 
> I'm trying to stay focused on specific problems -- UTF-8 ordering is defined.
> the gnu library doesn't follow it.
> 
> Major problem with so many progs relying on the lib!...

cheers,
Pádraig.

Message sent to bug-coreutils@HIDDEN:

X-Loop: help-debbugs@HIDDEN
Subject: bug#11621: questionable locale sorting order (especially as related	to char ranges in REs)
Resent-From: "Linda A. Walsh" <lkml@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Thu, 07 Jun 2012 01:19:01 +0000
Resent-Message-ID: <handler.11621.B11621.133903190124565 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 11621
X-GNU-PR-Package: coreutils
X-GNU-PR-Keywords: 
To: 11621 <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
Received: via spool by 11621-submit <at> debbugs.gnu.org id=B11621.133903190124565
          (code B ref 11621); Thu, 07 Jun 2012 01:19:01 +0000
Received: (at 11621) by debbugs.gnu.org; 7 Jun 2012 01:18:21 +0000
Received: from localhost ([127.0.0.1]:60976 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ScRMi-0006O9-U0
	for submit <at> debbugs.gnu.org; Wed, 06 Jun 2012 21:18:21 -0400
Received: from ishtar.tlinx.org ([173.164.175.65]:38176
	helo=Ishtar.sc.tlinx.org) by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <lkml@HIDDEN>) id 1ScRMf-0006O1-Ox
	for 11621 <at> debbugs.gnu.org; Wed, 06 Jun 2012 21:18:19 -0400
Received: from [192.168.3.12] (Athenae [192.168.3.12])
	by Ishtar.sc.tlinx.org (8.14.5/8.14.4/SuSE Linux 0.8) with ESMTP id
	q571G20e006154; Wed, 6 Jun 2012 18:16:05 -0700
Message-ID: <4FD000D2.1040207@HIDDEN>
Date: Wed, 06 Jun 2012 18:16:02 -0700
From: "Linda A. Walsh" <lkml@HIDDEN>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US;
	rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24
	Mnenhy/0.7.6.666
MIME-Version: 1.0
References: <4FCBE17F.40102@HIDDEN>
	<4FCBEBC0.90200@HIDDEN>	<4FCC41AA.3040306@HIDDEN>
	<4FCC7674.1090705@HIDDEN>
In-Reply-To: <4FCC7674.1090705@HIDDEN>
Content-Type: multipart/alternative;
	boundary="------------010006010006080008090301"
X-Spam-Score: -1.9 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)

This is a multi-part message in MIME format.
--------------010006010006080008090301
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by Ishtar.sc.tlinx.org id q571G20e006154

P=C3=A1draig Brady wrote:
> On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
>  =20
>> P=C3=A1draig Brady wrote:
>>    =20
>>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>>      =20
>>>> Within in the past few years, use of ranges in RE's has become
>>>> unreliable due to some locale changes sorting their native character
>>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>>
>>>> There seems to be a problem in when a user has set their system to u=
se
>>>> Unicode, it is no longer using the locale specific character set (is=
o-8859-x,
>>>> or others).
>>>>        =20
>> ----
>>     To clarify my above statement:
>>
>>
>>    There seems to be a problem in when a user has set their system to =
use
>> Unicode: It is no longer using the locale specific character set (iso-=
8859-x,
>> or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a co=
llation
>> order -- I don't know that they others do ('C' does, but I don't know =
about
>> other locale-specific character sets).
>>
>>
>>    =20
>>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>>> results in locale ordering:
>>>      =20
>> ----
>>     Can you cite a source specifying the sort/collation order of the
>> iso-8859-1 charset that would prove that it is not-conforming to the c=
ollation specification for that charset?
>>
>>     I.e. If there is no official source, then the order with that char=
set
>> is "undefined", and while it may not be desirable, returning a<A<b<B, =
would not be "an error".
>>    =20
>
> It's a charset. Of course the order is defined. Try: man iso-8859-1
>
> The relative ordering can be trivially inferred from the command I pres=
ented.
> But to be explicit:
>
> $ printf "%s\n" A b a =C3=A1 | iconv -t iso-8859-1 | LC_ALL=3Den_US [si=
c] sort | iconv -f iso-8859-1
> a
> A
> =C3=A1
> b
>  =20
----
Your example doesn't show the collation order of iso-8859-1.   You are=20
setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets=20
the default, but individual settings in the LC variables can override it.

A corrected example:

$ (Charset=3Diso-8859-1; printf "%s\n" A b B a =C3=A1 | iconv -t $Charset=
 |=20
LANG=3Den_US LC_CHARSET=3D$Charset LC_COLLATE=3D$Charset sort | iconv -f=20
$Charset |tr "\n" " ";echo "")=20
A B a b =C3=A1

(I used 'Charset' to hold the charset name, added parens, printed them=20
in the same orientation as input, and added a 2nd capital letter to make=20
upper/lower case ordering clear.)

    I might note how "trivial" it was to arrive at incorrect output. =20
People often think me a pain because I ask them to explain what they=20
perceive to be
obvious.  Unfortunately, what is obvious to 1 person may not be so to=20
another.

    The '=C3=A1' is not ASCII (original charset for C locale, coming from=
=20
unix & C programming language -- a reason why POSIX renamed the 'C'=20
local to the POSIX
locale.

    However, as '=C3=A1' is in the 1st 256 chars (above the ASCII range),=
 it=20
can still work if you remove the iconv stuff (and note, I have no other=20
locale vars
set:

$ echo ${!LC_*} ${!LAN*}
LC_COLLATE LC_CTYPE

$ (Charset=3DASCII; printf "%s\n" A B b a =C3=A1 |  LC_CHARSET=3D$Charset=
=20
LC_COLLATE=3D$Charset sort |tr "\n" " ";echo "")        =20
A B a b =C3=A1

    To bring this to completion -- most linux systems today use the UTF-8
character set.  It shows an *identical* collation order for the above=20
chars as the iso-8859-1 charset.

    It appears that the collating functions are confused by the notation=20
that has been adopted in many distributions...namely <locale>.charset.  =20
In such a notation, where the charset has been explicitly specified, and=20
where the charset has explicit COLLATION and case folding rules (those=20
for Unicode are extensive and handle accents as well as other forms like=20
=C5=BF=C8=98=C8=99=CA=82=C8=BF=E1=B5=B4=E1=B6=8A=E1=B9=A0=E1=B9=A1=E1=B9=A2=
=E1=B9=A3=E1=B9=A4=E1=B9=A5=E1=B9=A6=E1=B9=A7=E1=B9=A8=E1=B9=A9=E1=BA=9B=E1=
=BA=9C=E1=BA=9D=E1=BA=9E=E2=B1=BE=EA=9E=A8=EA=9E=A9Ss=C3=9F=C5=9A=C5=9B=C5=
=9C=C5=9D=C5=9E=C5=9F=C5=A0=C5=A1=CB=A2...etc.

    Therefore, I would like to see the character set's collation and=20
folding rules used where they are officially specified (as in the case=20
of Unicode or POSIX).

    Are you the person responsible for the libicuXXX files?




--------------010006010006080008090301
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by Ishtar.sc.tlinx.org id q571G20e006154

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <style>
/* Linda's Style playground (c) 2011 L. A Walsh (permission given to
   do w/this anything other than claim my original as your own!
   -- <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:dept.playgroun=
ds@HIDDEN">dept.playgrounds@HIDDEN</a> )
*/
/* margin:(X):=3DT+B+R+L; (V H):V=3DT+B,H=3DR+L; (T H B):T,H=3DR+L, (T R =
B L) */

html,body {
font: 12pt "Lucida Console", monospace, fixed;
font-size-adjust:.50;
background-color:#f8fefb; color:#104060;
max-width:90ex;=20
}


table, tbody, tr, td {font: inherit;font-size 11.4pt; }

p { margin: 1em; text-indent:1em }
p+p { margin-top: .75em;margin-bottom:.75em }

small { font-size:85.18% }
big { font-size:117.4% }

q { font-style:italic;
q.l { font-style:italic; font-family:cursive,sans-serif;}

em { font-variant:small-caps }
h6 { font-size:85.180%/117.398% }
h5 { font-size:100%/132.824% }
h4 { font-size:117.398%/161.803% }
h3 { font-size:132.824%/200.00% }
h2 { font-size:161.803%/234.797% }
h1 { font-size:200.000%/265.648% }

h1, h2, h3, h4, h5 {font-size: inherit; font-weight:bold}

h5, h6 {font-size: inherit; font-variant:small-caps;}
hr {font-family:monospace:fixed; width:90ex; margin:0;left}
h5 {font: inherit; font-weight:800 }
h6 { font: inherit; font-weight:700 }
h1,h2,h3,h4,h5,h6 { margin:1em }

blockquote { margin:1em 1em; font-style:italic; }
blockquote &gt; p, blockquote  &gt; blockquote {margin-top:0.50em;margin-=
bottom:0.50em; text-indent:0;}

* { -moz-tab-size:2; -o-tab-size:2; tab-size:2; }
pre, cite { margin: 1.2em .8em; }
pre, cite, tt {font-style:oblique; background-color:#f6f6f0; color:#20204=
0;
font-family:"Lucida Console", monospace;
}
pre+pre {font-inherit; font-style:oblique; background-color:#f6f6f0; colo=
r:#202040;
margin:1ex .8em }

address {font inherit; font-style:oblique; font-family:"Cambria";}
address {font:inherit; margin:1em 3em; background-color:#f8faff;}
address+address {margin:0 2em}

img { margin:1.6em }

q:before { content:open-quote }
q:after { content:close-quote }

a, a:link, a:focus, a:visited {text-decoration:underline}
a:link { color:#44BB33 }
a:focus { color: #22FF11 }
a:visited { color: #557722 }

.sig { font: oblique 15.75pt/84pt "Lucida Handwriting",cursive }
.sig:first-letter {
float:left;
font: italic 56pt/84pt "Lucida Calligraphy",cursive;
font-weight:200;
}

#sig_fl {
float:left;font:italic 56pt/84pt "Lucida Calligraphy",cursive;
font-weight:200;
}

@font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face {font-family:Cambria; panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face {font-family:"Lucida Calligraphy"; panose-1:3 1 1 1 1 1 1 1 1 =
1;}
@font-face {font-family:"Lucida Handwriting"; panose-1:3 1 1 1 1 1 1 1 1 =
1;}

.MsoNormal, .MsoNormalTab {
padding:0; margin:0; color:"darkmagenta"; background:"honeydew";
font: oblique 100%/100% "Calibri","Verdana","Arial" !important;
}
span.MsoNormal , span.MsoNormalTable {
font-family: inherit !important; font-size: inherit !important; font-styl=
e:
inherit !important; color: inherit !important;
}
span[font-family=3DArial], span[font-family=3D"Times New Roman"],
font[face=3DArial] ,font[face=3D"Times New Roman"] {
font-family: inherit !important; font-size-adjust:inherit !important;
font-size: inherit !important; line-height: inherit !important;
color: inherit !important;
}
  </style>
<!-- vim: ts=3D1 sw=3D1 et sc fo=3Dcqwa1 tw=3D78 syntax=3Dcss
-->
</head>
<body>
P=C3=A1draig Brady wrote:
<blockquote id=3D"mid_4FCC7674_1090705_draigBrady_com"
 cite=3D"mid:4FCC7674.1090705@HIDDEN" type=3D"cite">
  <pre wrap=3D"">On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
  </pre>
  <blockquote id=3D"StationeryCiteGenerated_1" type=3D"cite">
    <pre wrap=3D"">
P=C3=A1draig Brady wrote:
    </pre>
    <blockquote id=3D"StationeryCiteGenerated_2" type=3D"cite">
      <pre wrap=3D"">On 06/03/2012 11:13 PM, Linda Walsh wrote:
      </pre>
      <blockquote id=3D"StationeryCiteGenerated_3" type=3D"cite">
        <pre wrap=3D"">Within in the past few years, use of ranges in RE'=
s has become
unreliable due to some locale changes sorting their native character
sets such that a&lt;A&lt;b&lt;B&lt;y&lt;Y&lt;z&lt;Z (vs. 'C' ordering A&l=
t;B&lt;Y&lt;Z&lt;a&lt;b&lt;y&lt;z).

There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set (iso-885=
9-x,
or others).
        </pre>
      </blockquote>
    </blockquote>
    <pre wrap=3D"">----
    To clarify my above statement:


   There seems to be a problem in when a user has set their system to use
Unicode: It is no longer using the locale specific character set (iso-885=
9-x,
or others) -- ***or*** <b class=3D"moz-txt-star"><span class=3D"moz-txt-t=
ag">*</span>their<span class=3D"moz-txt-tag">*</span></b> <b class=3D"moz=
-txt-star"><span class=3D"moz-txt-tag">*</span>orderings<span class=3D"mo=
z-txt-tag">*</span></b>.  I.e. Unicode defines a collation
order -- I don't know that they others do ('C' does, but I don't know abo=
ut
other locale-specific character sets).


    </pre>
    <blockquote id=3D"StationeryCiteGenerated_4" type=3D"cite">
      <pre wrap=3D"">It's not specific to "unicode". Sorting in a iso-885=
9-1 charset
results in locale ordering:
      </pre>
    </blockquote>
    <pre wrap=3D"">----
    Can you cite a source specifying the sort/collation order of the
iso-8859-1 charset that would prove that it is not-conforming to the coll=
ation specification for that charset?

    I.e. If there is no official source, then the order with that charset
is "undefined", and while it may not be desirable, returning a&lt;A&lt;b&=
lt;B, would not be "an error".
    </pre>
  </blockquote>
  <pre wrap=3D""><!---->
It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presen=
ted.
But to be explicit:

$ printf "%s\n" A b a =C3=A1 | iconv -t iso-8859-1 | LC_ALL=3Den_US [sic]=
 sort | iconv -f iso-8859-1
a
A
=C3=A1
b
  </pre>
</blockquote>
----<br>
Your example doesn't show the collation order of iso-8859-1. =C2=A0 You a=
re
setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets
the default, but individual settings in the LC variables can override
it.<br>
<br>
A corrected example:<br>
<br>
$ (Charset=3Diso-8859-1; printf "%s\n" A b B a =C3=A1 | iconv -t $Charset=
 |
LANG=3Den_US LC_CHARSET=3D$Charset LC_COLLATE=3D$Charset sort | iconv -f
$Charset |tr "\n" " ";echo "")=C2=A0 <br>
A B a b =C3=A1 <br>
<br>
(I used 'Charset' to hold the charset name, added parens, printed them
in the same orientation as input, and added a 2nd capital letter to
make upper/lower case ordering clear.)<br>
<br>
=C2=A0=C2=A0=C2=A0 I might note how "trivial" it was to arrive at incorre=
ct output.=C2=A0
People often think me a pain because I ask them to explain what they
perceive to be<br>
obvious.=C2=A0 Unfortunately, what is obvious to 1 person may not be so t=
o
another.<br>
<br>
=C2=A0=C2=A0=C2=A0 The '=C3=A1' is not ASCII (original charset for C loca=
le, coming from
unix &amp; C programming language -- a reason why POSIX renamed the 'C'
local to the POSIX<br>
locale.<br>
<br>
=C2=A0=C2=A0=C2=A0 However, as '=C3=A1' is in the 1st 256 chars (above th=
e ASCII range), it
can still work if you remove the iconv stuff (and note, I have no other
locale vars<br>
set:<br>
<br>
$ echo ${!LC_*} ${!LAN*}<br>
LC_COLLATE LC_CTYPE<br>
<br>
$ (Charset=3DASCII; printf "%s\n" A B b a =C3=A1 |=C2=A0 LC_CHARSET=3D$Ch=
arset
LC_COLLATE=3D$Charset sort |tr "\n" " ";echo "")=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 <br>
A B a b =C3=A1 <br>
<br>
=C2=A0=C2=A0=C2=A0 To bring this to completion -- most linux systems toda=
y use the
UTF-8<br>
character set.=C2=A0 It shows an <b class=3D"moz-txt-star"><span class=3D=
"moz-txt-tag">*</span>identical<span class=3D"moz-txt-tag">*</span></b> c=
ollation order for the above
chars as the iso-8859-1 charset.<br>
<br>
=C2=A0=C2=A0=C2=A0 It appears that the collating functions are confused b=
y the
notation that has been adopted in many distributions...namely
&lt;locale&gt;.charset.=C2=A0=C2=A0 In such a notation, where the charset=
 has
been explicitly specified, and where the charset has explicit COLLATION
and case folding rules (those for Unicode are extensive and handle
accents as well as other forms like
=C5=BF=C8=98=C8=99=CA=82=C8=BF=E1=B5=B4=E1=B6=8A=E1=B9=A0=E1=B9=A1=E1=B9=A2=
=E1=B9=A3=E1=B9=A4=E1=B9=A5=E1=B9=A6=E1=B9=A7=E1=B9=A8=E1=B9=A9=E1=BA=9B=E1=
=BA=9C=E1=BA=9D=E1=BA=9E=E2=B1=BE=EA=9E=A8=EA=9E=A9Ss=C3=9F=C5=9A=C5=9B=C5=
=9C=C5=9D=C5=9E=C5=9F=C5=A0=C5=A1=CB=A2...etc.<br>
<br>
=C2=A0=C2=A0=C2=A0 Therefore, I would like to see the character set's col=
lation and
folding rules used where they are officially specified (as in the case
of Unicode or POSIX).<br>
<br>
=C2=A0=C2=A0=C2=A0 Are you the person responsible for the libicuXXX files=
?<br>
<br>
<br>
<br>
</body>
</html>

--------------010006010006080008090301--

Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.