GNU bug report logs - #21916
sort -u drops unique lines with some locales

Reported by: Christoph Anton Mitterer <calestyo <at> scientia.net>

Date: Sat, 14 Nov 2015 05:39:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21916 in the body.
You can then email your comments to 21916 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#21916; Package coreutils. (Sat, 14 Nov 2015 05:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Christoph Anton Mitterer <calestyo <at> scientia.net>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 14 Nov 2015 05:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.net>
To: bug-coreutils <at> gnu.org
Subject: sort -u drops unique lines with some locales
Date: Sat, 14 Nov 2015 06:38:11 +0100

[Message part 1 (text/plain, inline)]

Hey.

(GNU coreutils 8.23)

Attached is a file, that, when sort -u'ed in my locale, looses lines
which are however unique.

I've also attached the locale, since it's a custom made one, but the
same seem to happen with "standard" locales as well, see e.g.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489

Cheers,
Chris.

PS: Please keep me CCed, as I'm writing off list.

[test-file (text/plain, attachment)]

[test-file.unique-sorted (text/plain, attachment)]

[en_DE (text/plain, attachment)]

[smime.p7s (application/x-pkcs7-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21916; Package coreutils. (Sat, 14 Nov 2015 11:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 21916 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Christoph Anton Mitterer <calestyo <at> scientia.net>, 21916 <at> debbugs.gnu.org
Subject: Re: bug#21916: sort -u drops unique lines with some locales
Date: Sat, 14 Nov 2015 11:06:22 +0000

tag 21916 notabug
close 21916
stop

On 14/11/15 05:38, Christoph Anton Mitterer wrote:
> Hey.
> 
> (GNU coreutils 8.23)
> 
> Attached is a file, that, when sort -u'ed in my locale, looses lines
> which are however unique.
> 
> I've also attached the locale, since it's a custom made one, but the
> same seem to happen with "standard" locales as well, see e.g.
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
> 
> Cheers,
> Chris.
> 
> PS: Please keep me CCed, as I'm writing off list.

Unfortunately the roman numeral code points compare equal:

  $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
  sort->strcoll("\342\205\241", "\342\205\240") = 0
  Ⅱ
  Ⅰ

If you compare at the byte level you'll get appropriate grouping:

  $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
  Ⅰ
  Ⅱ

The same goes for other similar representations,
like full width forms of latin numbers:

  $ printf '%s\n' ２ １ | ltrace -e strcoll sort
  sort->strcoll("\357\274\222", "\357\274\221") = 0
  ２
  １

That's a bit surprising, though maybe since only a limited
number of these representations are provided, it was
not thought appropriate to provide collation orders for them.

There are details on the unicode representation at:
https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
Where it says "[f]or most purposes, it is preferable to compose the Roman numerals
from sequences of the appropriate Latin letters"

For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting:

$ printf '%s\n' I II III IV V VI VII VIII ІХ Х ХI ХII ХIII ХIV ХV ХVI ХVII ХVIII ХІХ | sort
I
II
III
IV
V
VI
VII
VIII
ІХ
Х
ХI
ХII
ХIII
ХIV
ХV
ХVI
ХVII
ХVIII
ХІХ

If there were only portions of the line that was appropriate to treat in the C locale
(not for your grouping case really, but generally for sorting for example),
then you'd need to consider transformations like enclosed, fullwidth, halfwidth -> ASCII
which might be done with a separate utility, and for number specific transformations
like the above, handled within the numfmt utility?

One thing we might do immediately, is maybe with the sort --debug option,
to provide some indication when strcoll() and memcmp() differ in direction.

cheers,
Pádraig.

Information forwarded to bug-coreutils <at> gnu.org:
bug#21916; Package coreutils. (Sat, 14 Nov 2015 21:21:01 GMT) Full text and rfc822 format available.

Message #11 received at 21916 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.net>
To: Pádraig Brady <P <at> draigBrady.com>, 21916 <at> debbugs.gnu.org
Subject: Re: bug#21916: sort -u drops unique lines with some locales
Date: Sat, 14 Nov 2015 22:19:55 +0100

[Message part 1 (text/plain, inline)]

Hey Pádraig

On Sat, 2015-11-14 at 11:06 +0000, Pádraig Brady wrote:
> Unfortunately the roman numeral code points compare equal:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
>   sort->strcoll("\342\205\241", "\342\205\240") = 0
>   Ⅱ
>   Ⅰ
> 
> If you compare at the byte level you'll get appropriate grouping:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
>   Ⅰ
>   Ⅱ
> 
> The same goes for other similar representations,
> like full width forms of latin numbers:
> 
>   $ printf '%s\n' ２ １ | ltrace -e strcoll sort
>   sort->strcoll("\357\274\222", "\357\274\221") = 0
>   ２
>   １
So the bug's basically in the locales?

> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.
Really strange...

> One thing we might do immediately, is maybe with the sort --debug
> option,
> to provide some indication when strcoll() and memcmp() differ in
> direction.
Well I think the main problem here is that -u does then actually not
what most people would expect from it.
AFAIU, it removes any lines that *collation would consider as
duplicate* ... and not any lines which *actually are duplicates*.

God knows how many scripts and other stuff this already breaks... and I
wonder whether any other tools may be badly affected by that collation
stuff, too...
Imagine you do a cp -a ... or diff -qr and these would leave out any of
such files they consider duplicate :-(
That could really result in data loss.

Actually that's how I stumbled over it... I made some lists with find,
of files which are then to be binary compared on a source and copy
filesystem... over the find result I once used just sort and once sort
-u and was quite shocked then.

If I had taken the sort -u sorted list, then I might have lost some
files to copy / compare.

The semantics of -u are IMHO even more problematic, as it (AFAIU) won't
happen with LANG=C.
But normally people wouldn't expect that different locales lead to
completely different behaviour, especially with respect to collation -
they would only expect that things are ordered differently.

Does it seems possible that sort -u spills out a warning on stderr,
when such case occurs where -u drops lines, which are considered
identical in terms of collation but which aren't really identical?

Cheers,
Chris.

btw: Is that bugtracker somewhere accessible? Cause I'd like to update
the Debian bug to having been forwarded to this one here.

[smime.p7s (application/x-pkcs7-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21916; Package coreutils. (Sat, 14 Nov 2015 21:24:02 GMT) Full text and rfc822 format available.

Message #14 received at 21916 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.net>
To: Pádraig Brady <P <at> draigBrady.com>, 21916 <at> debbugs.gnu.org
Subject: Re: bug#21916: sort -u drops unique lines with some locales
Date: Sat, 14 Nov 2015 22:23:11 +0100

[Message part 1 (text/plain, inline)]

Oh one further solution:

- document more properly in the manpage and --help, what -u really is,
and especially that it may not behave as expected, with other
locales/collations.
Perhaps even giving an example, so that people understand the
seriousness of that.

- add companion option, maybe -U, which sorts out *only* those lines
which are really binary identical between the two \n .

Cheers,
Chris.

[smime.p7s (application/x-pkcs7-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21916; Package coreutils. (Mon, 16 Nov 2015 00:12:02 GMT) Full text and rfc822 format available.

Message #17 received at 21916 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Christoph Anton Mitterer <calestyo <at> scientia.net>, 21916 <at> debbugs.gnu.org
Subject: Re: bug#21916: sort -u drops unique lines with some locales
Date: Sun, 15 Nov 2015 17:11:37 -0700

Pádraig Brady wrote:
> Christoph Anton Mitterer wrote:
> > Attached is a file, that, when sort -u'ed in my locale, looses lines
> > which are however unique.
> > 
> > I've also attached the locale, since it's a custom made one, but the
> > same seem to happen with "standard" locales as well, see e.g.
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
> > 
> > PS: Please keep me CCed, as I'm writing off list.
> 
> If you compare at the byte level you'll get appropriate grouping:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
>   Ⅰ
>   Ⅱ

It is also possible to set only LC_COLLATE=C and not set everything to C.

> The same goes for other similar representations,
> like full width forms of latin numbers:
> 
>   $ printf '%s\n' ２ １ | ltrace -e strcoll sort
>   sort->strcoll("\357\274\222", "\357\274\221") = 0
>   ２
>   １
>
> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.

Hmm...  Seems questionable to me.

> There are details on the unicode representation at:
> https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
> Where it says "[f]or most purposes, it is preferable to compose the Roman numerals
> from sequences of the appropriate Latin letters"
> 
> For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting:

One can transliterate them using 'iconv'.

  printf '%s\n' Ⅱ Ⅰ ２ １ | iconv -f UTF-8 -t ASCII//TRANSLIT | sort
  1
  2
  I
  II

Bob

Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:20:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 21916 <at> debbugs.gnu.org and Christoph Anton Mitterer <calestyo <at> scientia.net> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:20:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 22 Nov 2018 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 179 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21916 sort -u drops unique lines with some locales

GNU bug report logs - #21916
sort -u drops unique lines with some locales