GNU bug report logs - #8871
Bug with "sort -i" ?

Previous Next

Package: coreutils;

Reported by: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>

Date: Wed, 15 Jun 2011 16:04:02 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 8871 in the body.
You can then email your comments to 8871 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8871; Package coreutils. (Wed, 15 Jun 2011 16:04:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 15 Jun 2011 16:04:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>
To: bug-coreutils <at> gnu.org
Subject: Bug with "sort -i" ?
Date: Wed, 15 Jun 2011 17:42:51 +0200
Hi,

this looks like a bug for me:

var="φθινόπωρο,κισσός,Φύλλο"


echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
sort -f -u
κισσός
φθινόπωρο
φύλλο

echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
sort -f -i -u
φθινόπωρο

Al




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8871; Package coreutils. (Wed, 15 Jun 2011 20:10:03 GMT) Full text and rfc822 format available.

Message #8 received at 8871 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>
Cc: 8871 <at> debbugs.gnu.org
Subject: Re: bug#8871: Bug with "sort -i" ?
Date: Wed, 15 Jun 2011 14:08:49 -0600
[Message part 1 (text/plain, inline)]
retitle 8871 RFE enhance sort --debug -i
tag 8871 wishlist
thanks

On 06/15/2011 09:42 AM, Al Bogner wrote:
> Hi,
> 
> this looks like a bug for me:

Thanks for the report.  However, most likely this is not a bug in sort,
but a misunderstanding on your part about how locales affect which bytes
(or byte sequences, in multi-byte locales) are deemed printable.

> 
> var="φθινόπωρο,κισσός,Φύλλο"
> 
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \

Wow, that's a complex way to change comma into newline.  Why not just:

var="φθινόπωρο
κισσός
Φύλλο"
echo "$var" | sort ...

[I'm assuming you've distilled this from a larger example where the
complex processing was actually useful rather than starting from the
right string to begin with]

> sort -f -u
> κισσός
> φθινόπωρο
> φύλλο
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
> sort -f -i -u
> φθινόπωρο

Let's put the new 'sort --debug' option to use to point out the
difference a locale makes (and note that on my machine, the C locale
deems non-ASCII bytes as non-printable, even though they still render
just fine on my terminal).  First, without -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fu
sort: using simple byte comparison
Φύλλο
__________
κισσός
____________
φθινόπωρο
__________________


Did you notice how the line lengths differ between the en_US.UTF-8
locale (which knows how to treat multi-byte characters as single
characters) and the C locale (where every byte is a character, and the
multi-byte UTF-8 entities are treated as multiple non-printable characters)?

Then adding -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fui
coreutils/src/sort: using simple byte comparison
φθινόπωρο
__________________

When all of the bytes are ignored as non-printable, then all three lines
are identical, hence -u prints only one line.

However, I think this report _did_ find a valid tangential issue - the
'sort --debug' option ought to be enhanced to use a different character
than '_' when flagging which bytes were ignored by -i as unprintable
characters.  That is, I would find it much nicer to see:

$ echo 'aφc' | LC_ALL=C sort --debug -i
aφc
_.._

to make it obvious that the two bytes for φ were being ignored from the
particular sort field that I requested, because -i was in effect.  Same
thing goes for other sort options, such as 'sort -k1n' ignoring
characters after the end of the first parsed number.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8871; Package coreutils. (Wed, 15 Jun 2011 21:42:02 GMT) Full text and rfc822 format available.

Message #11 received at 8871 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>, 8871 <at> debbugs.gnu.org
Subject: Re: bug#8871: Bug with "sort -i" ?
Date: Wed, 15 Jun 2011 15:41:06 -0600
[Message part 1 (text/plain, inline)]
[re-adding the list]

On 06/15/2011 03:28 PM, Al Bogner wrote:
>> When all of the bytes are ignored as non-printable, then all three
>> lines are identical, hence -u prints only one line.
> 
> Ok and thanks. I had a different understanding of non-printable.

Non-printable translates to whether isprint(3) returns 0 for a given
byte (single-byte locale, like C), or iswprint(3) returns 0 for a given
wide character (Unicode character composed from UTF-8 bytes, multi-byte
locale like de_DE.UTF-8).  These functions are locale-specific (a byte
value may be deemed printable in one locale but not another).
Furthermore, isprint(0xa0) and iswprint(0xa0) may give different results
within the same locale, if the implementation is trying to reject
incomplete UTF-8 sequences and only understands complete wchar_t as
characters, in which case any code that uses isprint() on the individual
bytes of UTF-8 rather than iswprint() on the wchar_t of each composed
Unicode character will get the (unfortunate) results that no multi-byte
characters are recognized as printable.

Factor into this mess the fact that upstream coreutils still lacks
decent multi-byte handling in a lot of utilities.  Various distros have
add-on patches for better wchar_t handling, but as of yet they have not
been consolidated into something that is easily maintainable and adds no
overhead to the single-byte C locale situation.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8871; Package coreutils. (Thu, 16 Jun 2011 09:51:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Philipp Thomas <pth <at> suse.de>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#8871: Bug with "sort -i" ?
Date: Thu, 16 Jun 2011 11:50:25 +0200
* Eric Blake (eblake <at> redhat.com) [20110616 00:00]:

> been consolidated into something that is easily maintainable and adds no
> overhead to the single-byte C locale situation.

I at least doubt that there is a solution that adds no overhead.

Philipp




Added tag(s) notabug. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Thu, 16 Jun 2011 13:35:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 8871 <at> debbugs.gnu.org and Al Bogner <suse-linux <at> ml082.pinguin.uni.cc> Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Thu, 16 Jun 2011 13:35:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 15 Jul 2011 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 260 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.