GNU bug report logs -
#21916
sort -u drops unique lines with some locales
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21916 in the body.
You can then email your comments to 21916 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#21916
; Package
coreutils
.
(Sat, 14 Nov 2015 05:39:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Christoph Anton Mitterer <calestyo <at> scientia.net>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 14 Nov 2015 05:39:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hey.
(GNU coreutils 8.23)
Attached is a file, that, when sort -u'ed in my locale, looses lines
which are however unique.
I've also attached the locale, since it's a custom made one, but the
same seem to happen with "standard" locales as well, see e.g.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
Cheers,
Chris.
PS: Please keep me CCed, as I'm writing off list.
[test-file (text/plain, attachment)]
[test-file.unique-sorted (text/plain, attachment)]
[en_DE (text/plain, attachment)]
[smime.p7s (application/x-pkcs7-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#21916
; Package
coreutils
.
(Sat, 14 Nov 2015 11:07:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 21916 <at> debbugs.gnu.org (full text, mbox):
tag 21916 notabug
close 21916
stop
On 14/11/15 05:38, Christoph Anton Mitterer wrote:
> Hey.
>
> (GNU coreutils 8.23)
>
> Attached is a file, that, when sort -u'ed in my locale, looses lines
> which are however unique.
>
> I've also attached the locale, since it's a custom made one, but the
> same seem to happen with "standard" locales as well, see e.g.
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
>
> Cheers,
> Chris.
>
> PS: Please keep me CCed, as I'm writing off list.
Unfortunately the roman numeral code points compare equal:
$ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
sort->strcoll("\342\205\241", "\342\205\240") = 0
Ⅱ
Ⅰ
If you compare at the byte level you'll get appropriate grouping:
$ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
Ⅰ
Ⅱ
The same goes for other similar representations,
like full width forms of latin numbers:
$ printf '%s\n' 2 1 | ltrace -e strcoll sort
sort->strcoll("\357\274\222", "\357\274\221") = 0
2
1
That's a bit surprising, though maybe since only a limited
number of these representations are provided, it was
not thought appropriate to provide collation orders for them.
There are details on the unicode representation at:
https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
Where it says "[f]or most purposes, it is preferable to compose the Roman numerals
from sequences of the appropriate Latin letters"
For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting:
$ printf '%s\n' I II III IV V VI VII VIII ІХ Х ХI ХII ХIII ХIV ХV ХVI ХVII ХVIII ХІХ | sort
I
II
III
IV
V
VI
VII
VIII
ІХ
Х
ХI
ХII
ХIII
ХIV
ХV
ХVI
ХVII
ХVIII
ХІХ
If there were only portions of the line that was appropriate to treat in the C locale
(not for your grouping case really, but generally for sorting for example),
then you'd need to consider transformations like enclosed, fullwidth, halfwidth -> ASCII
which might be done with a separate utility, and for number specific transformations
like the above, handled within the numfmt utility?
One thing we might do immediately, is maybe with the sort --debug option,
to provide some indication when strcoll() and memcmp() differ in direction.
cheers,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#21916
; Package
coreutils
.
(Sat, 14 Nov 2015 21:21:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 21916 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hey Pádraig
On Sat, 2015-11-14 at 11:06 +0000, Pádraig Brady wrote:
> Unfortunately the roman numeral code points compare equal:
>
> $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
> sort->strcoll("\342\205\241", "\342\205\240") = 0
> Ⅱ
> Ⅰ
>
> If you compare at the byte level you'll get appropriate grouping:
>
> $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
> Ⅰ
> Ⅱ
>
> The same goes for other similar representations,
> like full width forms of latin numbers:
>
> $ printf '%s\n' 2 1 | ltrace -e strcoll sort
> sort->strcoll("\357\274\222", "\357\274\221") = 0
> 2
> 1
So the bug's basically in the locales?
> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.
Really strange...
> One thing we might do immediately, is maybe with the sort --debug
> option,
> to provide some indication when strcoll() and memcmp() differ in
> direction.
Well I think the main problem here is that -u does then actually not
what most people would expect from it.
AFAIU, it removes any lines that *collation would consider as
duplicate* ... and not any lines which *actually are duplicates*.
God knows how many scripts and other stuff this already breaks... and I
wonder whether any other tools may be badly affected by that collation
stuff, too...
Imagine you do a cp -a ... or diff -qr and these would leave out any of
such files they consider duplicate :-(
That could really result in data loss.
Actually that's how I stumbled over it... I made some lists with find,
of files which are then to be binary compared on a source and copy
filesystem... over the find result I once used just sort and once sort
-u and was quite shocked then.
If I had taken the sort -u sorted list, then I might have lost some
files to copy / compare.
The semantics of -u are IMHO even more problematic, as it (AFAIU) won't
happen with LANG=C.
But normally people wouldn't expect that different locales lead to
completely different behaviour, especially with respect to collation -
they would only expect that things are ordered differently.
Does it seems possible that sort -u spills out a warning on stderr,
when such case occurs where -u drops lines, which are considered
identical in terms of collation but which aren't really identical?
Cheers,
Chris.
btw: Is that bugtracker somewhere accessible? Cause I'd like to update
the Debian bug to having been forwarded to this one here.
[smime.p7s (application/x-pkcs7-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#21916
; Package
coreutils
.
(Sat, 14 Nov 2015 21:24:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 21916 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Oh one further solution:
- document more properly in the manpage and --help, what -u really is,
and especially that it may not behave as expected, with other
locales/collations.
Perhaps even giving an example, so that people understand the
seriousness of that.
- add companion option, maybe -U, which sorts out *only* those lines
which are really binary identical between the two \n .
Cheers,
Chris.
[smime.p7s (application/x-pkcs7-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#21916
; Package
coreutils
.
(Mon, 16 Nov 2015 00:12:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 21916 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> Christoph Anton Mitterer wrote:
> > Attached is a file, that, when sort -u'ed in my locale, looses lines
> > which are however unique.
> >
> > I've also attached the locale, since it's a custom made one, but the
> > same seem to happen with "standard" locales as well, see e.g.
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
> >
> > PS: Please keep me CCed, as I'm writing off list.
>
> If you compare at the byte level you'll get appropriate grouping:
>
> $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
> Ⅰ
> Ⅱ
It is also possible to set only LC_COLLATE=C and not set everything to C.
> The same goes for other similar representations,
> like full width forms of latin numbers:
>
> $ printf '%s\n' 2 1 | ltrace -e strcoll sort
> sort->strcoll("\357\274\222", "\357\274\221") = 0
> 2
> 1
>
> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.
Hmm... Seems questionable to me.
> There are details on the unicode representation at:
> https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
> Where it says "[f]or most purposes, it is preferable to compose the Roman numerals
> from sequences of the appropriate Latin letters"
>
> For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting:
One can transliterate them using 'iconv'.
printf '%s\n' Ⅱ Ⅰ 2 1 | iconv -f UTF-8 -t ASCII//TRANSLIT | sort
1
2
I
II
Bob
Added tag(s) notabug.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Wed, 24 Oct 2018 21:20:01 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
21916 <at> debbugs.gnu.org and Christoph Anton Mitterer <calestyo <at> scientia.net>
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Wed, 24 Oct 2018 21:20:01 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 22 Nov 2018 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 5 years and 179 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.