GNU bug report logs - #24601
UTF-8 locale makes lexicographic sort weird

Previous Next

Package: coreutils;

Reported by: mathew <meta <at> pobox.com>

Date: Mon, 3 Oct 2016 21:35:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24601 in the body.
You can then email your comments to 24601 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Mon, 03 Oct 2016 21:35:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to mathew <meta <at> pobox.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 03 Oct 2016 21:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: mathew <meta <at> pobox.com>
To: bug-coreutils <at> gnu.org
Subject: UTF-8 locale makes lexicographic sort weird
Date: Mon, 03 Oct 2016 19:54:02 +0000
[Message part 1 (text/plain, inline)]
coreutils-8.25 compiled from source on Fedora 24:

% echo "+00\n-0c\n+02\n-02" | src/sort
+00
-02
+02
-0c

This seems to be due to locale:

% echo "+00\n-0c\n+02\n-02" | LC_ALL=C src/sort
+00
+02
-02
-0c

echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 src/sort
+00
-02
+02
-0c

Since OS X 10.11 still comes with coreutils 5.93, I tried that:

% echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
+02
-02
-0c

I've taken a look at the Unicode collation standard, and I can't
immediately see anything that explains the current (8.25) behavior.

I've also played around with <
http://demo.icu-project.org/icu-bin/locexp?_=en_US.UTF-8&d_=en&x=col> and I
can't come up with any set of Unicode collation options that gives the same
results.


mathew
[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Mon, 03 Oct 2016 21:58:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Mon, 03 Oct 2016 21:58:03 GMT) Full text and rfc822 format available.

Notification sent to mathew <meta <at> pobox.com>:
bug acknowledged by developer. (Mon, 03 Oct 2016 21:58:03 GMT) Full text and rfc822 format available.

Message #12 received at 24601-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: mathew <meta <at> pobox.com>, 24601-done <at> debbugs.gnu.org
Subject: Re: bug#24601: UTF-8 locale makes lexicographic sort weird
Date: Mon, 3 Oct 2016 16:57:53 -0500
[Message part 1 (text/plain, inline)]
tag 24601 notabug
thanks

On 10/03/2016 02:54 PM, mathew wrote:
> coreutils-8.25 compiled from source on Fedora 24:
> 
> % echo "+00\n-0c\n+02\n-02" | src/sort

Not all 'echo' programs understand \n as an escape sequence; you are
better off using the portable printf(1) when trying to demonstrate
simple programs, as in:

$ printf '+00\n-0c\n+02\n-02' | sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
+00
___
-02
___
+02
___
-0c
___

> 
> This seems to be due to locale:

Indeed, it is entirely due to locale, and hence is not a bug but rather
POSIX-mandated behavior that sort honors your locale rules.

> Since OS X 10.11 still comes with coreutils 5.93, I tried that:
> 
> % echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
> +00
> +02
> -02
> -0c

The sad thing is that POSIX says that locale authors (for all but the C
locale) have absolute control over all sorts of fiddly aspects of how
strcoll() behaves, and that just because two vendors declare that their
locale is named en_US.UTF-8 does NOT require those two vendors to have
the same locale definition.  So the collation rules between two
different platforms are very likely different, based on whoever wrote
the locale file for that platform, and what bug fixes have been
incorporated into the locale definition over time.

It appears that you are complaining that between your two systems, one
sorts the line '-02' before '+02' (even though it was specified later);
while the other system leaves the two lines unchanged with '+02' first.
If strcoll("-02", "+02") says the two strings collate identically, then
the two lines should have a final tie-breaker based on byte values
(which would put +02 first by byte values); but if the locale has
secondary (or even tertiary) sorting rules that put '-' before '+' (even
after the primary pass ignores punctuation and focuses only on
alphanumerics), then there is no chance for the tiebreaker rule to kick in.

Sadly, even the 'sort --debug' option is not able to easily demonstrate
the subtleties that go into the strcoll() function's rules for obeying
the locale sorting specification.

> 
> I've taken a look at the Unicode collation standard, and I can't
> immediately see anything that explains the current (8.25) behavior.

Locale rules are not required to follow Unicode collation rules, at
least not by POSIX.  It would be nice if all locales were synonymous
across platforms and behaved equivalently to Unicode rules, and in fact
glibc locale authors try hard to obey Unicode when writing locales, but
reading the Unicode collation standard will not tell you how a
particular locale will behave; only reading that locale's definition
will tell you what it will do.

At any rate, I don't see any bug in coreutils proper.  Perhaps you have
uncovered a problem in glibc's locale definition for changing in
behavior over time compared to what you think it should do (you didn't
even state what you were EXPECTING to see, only that the output differed
from your expectations); but if so, that is better reported to the glibc
list.  In the meantime, I'm closing this as not a coreutils bug,
although you can feel free to continue the conversation.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Tue, 04 Oct 2016 15:10:02 GMT) Full text and rfc822 format available.

Message #15 received at 24601 <at> debbugs.gnu.org (full text, mbox):

From: mathew <meta <at> pobox.com>
To: 24601 <at> debbugs.gnu.org
Subject: Update
Date: Tue, 04 Oct 2016 14:12:18 +0000
[Message part 1 (text/plain, inline)]
I reported the issue to the glibc maintainers, but they seem skeptical.
<https://sourceware.org/bugzilla/show_bug.cgi?id=20664>
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Tue, 04 Oct 2016 15:28:01 GMT) Full text and rfc822 format available.

Message #18 received at 24601 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: mathew <meta <at> pobox.com>, 24601 <at> debbugs.gnu.org
Subject: Re: bug#24601: Update
Date: Tue, 4 Oct 2016 08:27:35 -0700
On 10/04/2016 07:12 AM, mathew wrote:
> I reported the issue to the glibc maintainers, but they seem skeptical.
> <https://sourceware.org/bugzilla/show_bug.cgi?id=20664>

They don't sound that skeptical to me. They just want a test case that 
involves glibc strcoll only. Although such a test case should be easy to 
write they're overworked and busy, and if neither they nor you have time 
to construct such a test case, the problem is evidently lower priority.





Information forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Tue, 04 Oct 2016 17:08:02 GMT) Full text and rfc822 format available.

Message #21 received at 24601 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, mathew <meta <at> pobox.com>,
 24601 <at> debbugs.gnu.org
Subject: Re: bug#24601: Update
Date: Tue, 4 Oct 2016 13:06:55 -0400
Hello,

On 10/04/2016 11:27 AM, Paul Eggert wrote:
> On 10/04/2016 07:12 AM, mathew wrote:
>> I reported the issue to the glibc maintainers, but they seem
>> skeptical. <https://sourceware.org/bugzilla/show_bug.cgi?id=20664>
>
> They don't sound that skeptical to me. They just want a test case
> that involves glibc strcoll only. Although such a test case should be
> easy to write they're overworked and busy, and if neither they nor
> you have time to construct such a test case, the problem is evidently
> lower priority.

This is also relevant for some of the multibyte code I'm writing,
and also hints back to Karl's issues in
   https://debbugs.gnu.org/23677

I'll send a test program that can be used to explore this further
(and will work on multiple systems for comparison) - but only much later tonight.

- assaf




Information forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Tue, 04 Oct 2016 21:24:02 GMT) Full text and rfc822 format available.

Message #24 received at 24601 <at> debbugs.gnu.org (full text, mbox):

From: mathew <meta <at> pobox.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 24601 <at> debbugs.gnu.org
Subject: Re: bug#24601: Update
Date: Tue, 04 Oct 2016 16:24:40 +0000
[Message part 1 (text/plain, inline)]
I've given them a test case in plain C.


mathew

On Tue, Oct 4, 2016 at 10:27 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 10/04/2016 07:12 AM, mathew wrote:
> > I reported the issue to the glibc maintainers, but they seem skeptical.
> > <https://sourceware.org/bugzilla/show_bug.cgi?id=20664>
>
> They don't sound that skeptical to me. They just want a test case that
> involves glibc strcoll only. Although such a test case should be easy to
> write they're overworked and busy, and if neither they nor you have time
> to construct such a test case, the problem is evidently lower priority.
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#24601; Package coreutils. (Wed, 05 Oct 2016 03:30:02 GMT) Full text and rfc822 format available.

Message #27 received at 24601 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: mathew <meta <at> pobox.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 24601 <at> debbugs.gnu.org
Subject: Re: bug#24601: Update
Date: Tue, 4 Oct 2016 23:29:41 -0400
[Message part 1 (text/plain, inline)]
Hello,

Attached is my test program for strcoll.
It is slightly big, but mostly because of additional help and debug printing.

This will also become relevant when we deal with multibyte sort/join/uniq later on.

From brief testing, it seems glibc with UTF-8 locales is the only libc that has special collation order for non-letters/punctuation characters.
As Andreas Schwab explained:
  "They are not ignored, just considered only secondary, if the first order
   characters didn't provide an ordering.".
  http://lists.gnu.org/archive/html/bug-coreutils/2016-06/msg00005.html

Interestingly, "ja_JP.UTF-8" is the only locale in glibc that uses a different order than all other UTF-8 locales, and its ordering is more "intuitive" (closer to ascii).

Testing results from various systems are below.

The program is also available for download here:
  wget http://files.housegordon.org/src/strcoll-test.c
Compilation is trivial:
  cc -o strcoll-test strcoll-test.c

Usage is:
    Usage: ./strcoll-test [-svl] [[-KM] | TEXT1 TEXT2 TEXTn...] 
    
    Sorts TEXT1 TEXT2 TEXTn... according to the currently
    active locale using strcoll(3) call.
    If no parameters are specified, assumes '-M' instead.
    
    Options:
     -s   print result of each strcoll(3) call
     -v   print uname/glibc version information (if available)
     -l   print active local name
    
     -K   use input from https://debbugs.gnu.org/23677
     -M   use input from https://debbugs.gnu.org/24601 (default)
    
    Use LC_ALL to set locale.
    
    Examples:
    
      $ ./strcoll-test -ls '!a' '$z' '#c'
      active locale: en_US.UTF-8
      strcoll('$z', '#c') = 23
      strcoll('!a', '#c') = -2
      !a
      #c
      $z


Comments welcomed,
 - assaf


[strcoll-test.c (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]

====

### Ubuntu 14.04 with glibc 2.19

$ ./strcoll-test -svl -M                                                                       
Linux 3.13.0-88-generic glibc 2.19 stable
active locale: en_US.UTF-8
strcoll('+00', '-0c') = -12
strcoll('+02', '-02') = 62
strcoll('+00', '-02') = -2
strcoll('-0c', '-02') = 10
strcoll('-0c', '+02') = 10
+00
-02
+02
-0c

### Fedora 24

$ ./strcoll-test -svl -M
Linux 4.6.3-300.fc24.x86_64 glibc 2.23 stable
active locale: en_US.UTF-8
strcoll('+00', '-0c') = -12
strcoll('+02', '-02') = 62
strcoll('+00', '-02') = -2
strcoll('-0c', '-02') = 10
strcoll('-0c', '+02') = 10
+00
-02
+02
-0c


### Glibc with locale ja_JP.UTF-8 - not the same collation as other UTF-8 locales

$ LC_ALL=ja_JP.UTF-8 ./strcoll-test-glibc -svl -M
Linux 3.13.0-88-generic glibc 2.19 stable
active locale: ja_JP.UTF-8
strcoll('+00', '-0c') = -2
strcoll('+02', '-02') = -2
strcoll('+00', '+02') = -2
strcoll('-0c', '+02') = 2
strcoll('-0c', '-02') = 49
+00
+02
-02
-0c


### Musl Libc 1.15 on Ubuntu 14.04

$ ./strcoll-test-musl -svl -M                                                                        
Linux 3.13.0-88-generic not-glibc
active locale: en_US.UTF-8;en_US.UTF-8;en_US.UTF-8;en_US.UTF-8;en_US.UTF-8;en_US.UTF-8
strcoll('+02', '+00') = 2
strcoll('+02', '-0c') = -2
strcoll('+00', '-0c') = -2
strcoll('-0c', '-02') = 49
strcoll('-02', '+00') = 2
strcoll('-02', '+02') = 2
strcoll('+00', '+02') = -2
+00
+02
-02
-0c


### OpenBSD 6.0

$ LC_ALL=en_US.UTF-8 ./strcoll-test -svl -M
OpenBSD 6.0 not-glibc
active locale: C/en_US.UTF-8/C/C/C/en_US.UTF-8
strcoll('+00', '-0c') = -2
strcoll('-0c', '+02') = 2
strcoll('+00', '+02') = -2
strcoll('-0c', '-02') = 49
strcoll('+02', '-02') = -2
+00
+02
-02
-0c


### FreeBSD 10.3

$ LC_ALL=en_US.UTF-8 ./strcoll-test -svl -M
FreeBSD 10.3-RELEASE not-glibc
active locale: en_US.UTF-8
strcoll('+00', '-0c') = -2
strcoll('-0c', '+02') = 2
strcoll('+00', '+02') = -2
strcoll('-0c', '-02') = 49
strcoll('+02', '-02') = -2
+00
+02
-02
-0c


### Mac OS X 10.10.4

$ ./strcoll-test -svl -M
Darwin 14.4.0 not-glibc
active locale: en_US.UTF-8
strcoll('+00', '-0c') = -2
strcoll('-0c', '+02') = 2
strcoll('+00', '+02') = -2
strcoll('-0c', '-02') = 49
strcoll('+02', '-02') = -2
+00
+02
-02
-0c



## OpenSolaris 5.11/x86
## strange exception where the collation order of '+' and '-' is reversed.

$ ./strcoll-test -sl -M
active locale: en_US.UTF-8
strcoll('+00', '-0c') = 461
strcoll('+00', '+02') = -39
strcoll('+02', '-02') = 461
strcoll('+00', '-02') = 461
strcoll('-0c', '-02') = 219
-02
-0c
+00
+02







bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 02 Nov 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 171 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.