GNU bug report logs - #40226
sort: expected sort order when -c in use

Previous Next

Package: coreutils;

Reported by: Richard Ipsum <richardipsum <at> vx21.xyz>

Date: Wed, 25 Mar 2020 17:55:02 UTC

Severity: normal

To reply to this bug, email your comments to 40226 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#40226; Package coreutils. (Wed, 25 Mar 2020 17:55:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Richard Ipsum <richardipsum <at> vx21.xyz>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 25 Mar 2020 17:55:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Richard Ipsum <richardipsum <at> vx21.xyz>
To: bug-coreutils <at> gnu.org
Subject: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 18:37:38 +0100
Hi,

I'm trying to understand something and thought it would be good to ask
here.

I get different results for a case-insensitive sort using -c. My
understanding is that -f should lead to lower case characters with upper
case equivalents being converted to their upper case equivalents. This
doesn't seem to be happening for the C locale though.

% echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
sort: -:2: disorder: AAAA

Is this considered a bug or an expected difference between the locales?

Thanks,
Richard




Information forwarded to bug-coreutils <at> gnu.org:
bug#40226; Package coreutils. (Wed, 25 Mar 2020 18:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 40226 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Richard Ipsum <richardipsum <at> vx21.xyz>, 40226 <at> debbugs.gnu.org
Subject: Re: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 13:17:19 -0500
On 3/25/20 12:37 PM, Richard Ipsum wrote:
> Hi,
> 
> I'm trying to understand something and thought it would be good to ask
> here.
> 
> I get different results for a case-insensitive sort using -c. My
> understanding is that -f should lead to lower case characters with upper
> case equivalents being converted to their upper case equivalents. This
> doesn't seem to be happening for the C locale though.
> 
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
> sort: -:2: disorder: AAAA

First, 'echo -e' is not portable, so I'll be reproducing your example 
with printf.  And you are assuming that LC_ALL is not set (otherwise, 
LC_COLLATE would have no impact); so I'll set LC_ALL to be sure.  Except 
that I can't reproduce your example (I'm using Fedora 31, coreutils 8.31):

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f -
sort: -:2: disorder: AAAA

So there's probably something different in the locale libraries and/or 
your coreutils version on your system, compared to mine.

Next, let's debug things to see why:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f - --debug
sort: options '-c --debug' are incompatible

Oh, bummer - I don't know why we have that restriction.  Okay, let's try 
a slightly different approach:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
AAAA
____
____
aaaa
____
____
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug -s
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
aaaa
____
AAAA
____

See the difference?  In the first case, sort is doing its default 
case-insensitive comparison of the entire line (because you passed -f 
but not -k), AND a stability comparison of the byte values of the entire 
line (as shown by the two ____ lines per input).  But in the second 
case, when you add -s, the stability comparison is omitted.  The two 
lines are indeed different when the stability comparison is performed, 
explaining why -c choked when -s is absent.  Or put another way, -f 
affects only -k, including the implied -k1 when you don't specify 
anything, and not -s.  So now that we know that, let's return to your 
example:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - -c -s
$ echo $?
0


> 
> Is this considered a bug or an expected difference between the locales?

I don't know if it's the locale definition, or something changed between 
coreutils versions, or both; although I'm more likely to chalk it up to 
locale issues and not something where coreutils needs a patch, other 
than perhaps a documentation patch.  I'll leave the bug report itself 
open for a bit longer, in case anyone else has an opinion.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





Information forwarded to bug-coreutils <at> gnu.org:
bug#40226; Package coreutils. (Wed, 25 Mar 2020 21:17:02 GMT) Full text and rfc822 format available.

Message #11 received at 40226 <at> debbugs.gnu.org (full text, mbox):

From: Richard Ipsum <richardipsum <at> vx21.xyz>
To: Eric Blake <eblake <at> redhat.com>
Cc: 40226 <at> debbugs.gnu.org
Subject: Re: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 21:02:32 +0100
On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:
> On 3/25/20 12:37 PM, Richard Ipsum wrote:
[snip]
> 
> See the difference?  In the first case, sort is doing its default
> case-insensitive comparison of the entire line (because you passed -f but
> not -k), AND a stability comparison of the byte values of the entire line
> (as shown by the two ____ lines per input).  But in the second case, when
> you add -s, the stability comparison is omitted.  The two lines are indeed
> different when the stability comparison is performed, explaining why -c
> choked when -s is absent.  Or put another way, -f affects only -k, including
> the implied -k1 when you don't specify anything, and not -s.  So now that we
> know that, let's return to your example:

I'm trying to understand this relative to POSIX, which makes no mention
of stability as far as I can see (and there is no -s in POSIX). POSIX
says that -f should override the default ordering rules. I don't
understand why the last-resort comparison is required when -c is in use,
since we're not sorting with -c, just checking if the input is already sorted?

Put another way should -c imply -s ?

Thanks,
Richard




Information forwarded to bug-coreutils <at> gnu.org:
bug#40226; Package coreutils. (Wed, 25 Mar 2020 21:36:02 GMT) Full text and rfc822 format available.

Message #14 received at 40226 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Richard Ipsum <richardipsum <at> vx21.xyz>
Cc: 40226 <at> debbugs.gnu.org
Subject: Re: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 16:35:47 -0500
On 3/25/20 3:02 PM, Richard Ipsum wrote:
> On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:
>> On 3/25/20 12:37 PM, Richard Ipsum wrote:
> [snip]
>>
>> See the difference?  In the first case, sort is doing its default
>> case-insensitive comparison of the entire line (because you passed -f but
>> not -k), AND a stability comparison of the byte values of the entire line
>> (as shown by the two ____ lines per input).  But in the second case, when
>> you add -s, the stability comparison is omitted.  The two lines are indeed
>> different when the stability comparison is performed, explaining why -c
>> choked when -s is absent.  Or put another way, -f affects only -k, including
>> the implied -k1 when you don't specify anything, and not -s.  So now that we
>> know that, let's return to your example:
> 
> I'm trying to understand this relative to POSIX, which makes no mention
> of stability as far as I can see (and there is no -s in POSIX). POSIX
> says that -f should override the default ordering rules. I don't
> understand why the last-resort comparison is required when -c is in use,
> since we're not sorting with -c, just checking if the input is already sorted?

POSIX states [sort description]:

"If this collating sequence does not have a total ordering of all 
characters (see XBD LC_COLLATE), any lines of input that collate equally 
should be further compared byte-by-byte using the collating sequence for 
the POSIX locale."

As I understand it, this is true even when -f modifies the collating 
sequence to compare all lowercase characters as their uppercase equivalent.

But POSIX further states [XBD LC_COLLATE]:

"All implementation-provided locales (either preinstalled or provided as 
locale definitions which can be installed later) should define a 
collation sequence that has a total ordering of all characters unless 
the locale name has an '@' modifier indicating that it has a special 
collation sequence (for example, @icase could indicate that each upper 
and lowercase character pair collates equally).

Notes:

        A future version of this standard may require these locales to 
define a collation sequence that has a total ordering of all characters 
(by changing "should" to "shall").

        Users installing their own locales should ensure that they 
define a collation sequence with a total ordering of all characters 
unless an '@' modifier in the locale name (such as @icase ) indicates 
that it has a special collation sequence."

> 
> Put another way should -c imply -s ?

Maybe we compromise, and state that -c implies -s only for locales that 
do not include @ in their name (that is, if a locale already guarantees 
a total ordering of all characters, then even when -f collapses 
lowercase into uppercase, we don't need the final-resort comparison; but 
if a locale does not guarantee total ordering, the -s has to be explicit)?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





This bug report was last modified 4 years and 225 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.