GNU bug report logs -
#40226
sort: expected sort order when -c in use
Previous Next
To reply to this bug, email your comments to 40226 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#40226
; Package
coreutils
.
(Wed, 25 Mar 2020 17:55:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Richard Ipsum <richardipsum <at> vx21.xyz>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 25 Mar 2020 17:55:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
I'm trying to understand something and thought it would be good to ask
here.
I get different results for a case-insensitive sort using -c. My
understanding is that -f should lead to lower case characters with upper
case equivalents being converted to their upper case equivalents. This
doesn't seem to be happening for the C locale though.
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
sort: -:2: disorder: AAAA
Is this considered a bug or an expected difference between the locales?
Thanks,
Richard
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#40226
; Package
coreutils
.
(Wed, 25 Mar 2020 18:18:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 40226 <at> debbugs.gnu.org (full text, mbox):
On 3/25/20 12:37 PM, Richard Ipsum wrote:
> Hi,
>
> I'm trying to understand something and thought it would be good to ask
> here.
>
> I get different results for a case-insensitive sort using -c. My
> understanding is that -f should lead to lower case characters with upper
> case equivalents being converted to their upper case equivalents. This
> doesn't seem to be happening for the C locale though.
>
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
> sort: -:2: disorder: AAAA
First, 'echo -e' is not portable, so I'll be reproducing your example
with printf. And you are assuming that LC_ALL is not set (otherwise,
LC_COLLATE would have no impact); so I'll set LC_ALL to be sure. Except
that I can't reproduce your example (I'm using Fedora 31, coreutils 8.31):
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f -
sort: -:2: disorder: AAAA
So there's probably something different in the locale libraries and/or
your coreutils version on your system, compared to mine.
Next, let's debug things to see why:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f - --debug
sort: options '-c --debug' are incompatible
Oh, bummer - I don't know why we have that restriction. Okay, let's try
a slightly different approach:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
AAAA
____
____
aaaa
____
____
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug -s
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
aaaa
____
AAAA
____
See the difference? In the first case, sort is doing its default
case-insensitive comparison of the entire line (because you passed -f
but not -k), AND a stability comparison of the byte values of the entire
line (as shown by the two ____ lines per input). But in the second
case, when you add -s, the stability comparison is omitted. The two
lines are indeed different when the stability comparison is performed,
explaining why -c choked when -s is absent. Or put another way, -f
affects only -k, including the implied -k1 when you don't specify
anything, and not -s. So now that we know that, let's return to your
example:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - -c -s
$ echo $?
0
>
> Is this considered a bug or an expected difference between the locales?
I don't know if it's the locale definition, or something changed between
coreutils versions, or both; although I'm more likely to chalk it up to
locale issues and not something where coreutils needs a patch, other
than perhaps a documentation patch. I'll leave the bug report itself
open for a bit longer, in case anyone else has an opinion.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#40226
; Package
coreutils
.
(Wed, 25 Mar 2020 21:17:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 40226 <at> debbugs.gnu.org (full text, mbox):
On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:
> On 3/25/20 12:37 PM, Richard Ipsum wrote:
[snip]
>
> See the difference? In the first case, sort is doing its default
> case-insensitive comparison of the entire line (because you passed -f but
> not -k), AND a stability comparison of the byte values of the entire line
> (as shown by the two ____ lines per input). But in the second case, when
> you add -s, the stability comparison is omitted. The two lines are indeed
> different when the stability comparison is performed, explaining why -c
> choked when -s is absent. Or put another way, -f affects only -k, including
> the implied -k1 when you don't specify anything, and not -s. So now that we
> know that, let's return to your example:
I'm trying to understand this relative to POSIX, which makes no mention
of stability as far as I can see (and there is no -s in POSIX). POSIX
says that -f should override the default ordering rules. I don't
understand why the last-resort comparison is required when -c is in use,
since we're not sorting with -c, just checking if the input is already sorted?
Put another way should -c imply -s ?
Thanks,
Richard
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#40226
; Package
coreutils
.
(Wed, 25 Mar 2020 21:36:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 40226 <at> debbugs.gnu.org (full text, mbox):
On 3/25/20 3:02 PM, Richard Ipsum wrote:
> On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:
>> On 3/25/20 12:37 PM, Richard Ipsum wrote:
> [snip]
>>
>> See the difference? In the first case, sort is doing its default
>> case-insensitive comparison of the entire line (because you passed -f but
>> not -k), AND a stability comparison of the byte values of the entire line
>> (as shown by the two ____ lines per input). But in the second case, when
>> you add -s, the stability comparison is omitted. The two lines are indeed
>> different when the stability comparison is performed, explaining why -c
>> choked when -s is absent. Or put another way, -f affects only -k, including
>> the implied -k1 when you don't specify anything, and not -s. So now that we
>> know that, let's return to your example:
>
> I'm trying to understand this relative to POSIX, which makes no mention
> of stability as far as I can see (and there is no -s in POSIX). POSIX
> says that -f should override the default ordering rules. I don't
> understand why the last-resort comparison is required when -c is in use,
> since we're not sorting with -c, just checking if the input is already sorted?
POSIX states [sort description]:
"If this collating sequence does not have a total ordering of all
characters (see XBD LC_COLLATE), any lines of input that collate equally
should be further compared byte-by-byte using the collating sequence for
the POSIX locale."
As I understand it, this is true even when -f modifies the collating
sequence to compare all lowercase characters as their uppercase equivalent.
But POSIX further states [XBD LC_COLLATE]:
"All implementation-provided locales (either preinstalled or provided as
locale definitions which can be installed later) should define a
collation sequence that has a total ordering of all characters unless
the locale name has an '@' modifier indicating that it has a special
collation sequence (for example, @icase could indicate that each upper
and lowercase character pair collates equally).
Notes:
A future version of this standard may require these locales to
define a collation sequence that has a total ordering of all characters
(by changing "should" to "shall").
Users installing their own locales should ensure that they
define a collation sequence with a total ordering of all characters
unless an '@' modifier in the locale name (such as @icase ) indicates
that it has a special collation sequence."
>
> Put another way should -c imply -s ?
Maybe we compromise, and state that -c implies -s only for locales that
do not include @ in their name (that is, if a locale already guarantees
a total ordering of all characters, then even when -f collapses
lowercase into uppercase, we don't need the final-resort comparison; but
if a locale does not guarantee total ordering, the -s has to be explicit)?
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
This bug report was last modified 4 years and 247 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.