GNU bug report logs - #23677
sort --debug not ignoring punctuation when sort does

Previous Next

Package: coreutils;

Reported by: Karl Berry <karl <at> freefriends.org>

Date: Wed, 1 Jun 2016 22:16:01 UTC

Severity: normal

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23677 in the body.
You can then email your comments to 23677 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#23677; Package coreutils. (Wed, 01 Jun 2016 22:16:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Karl Berry <karl <at> freefriends.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 01 Jun 2016 22:16:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Karl Berry <karl <at> freefriends.org>
To: bug-coreutils <at> gnu.org
Subject: sort --debug not ignoring punctuation when sort does
Date: Wed, 1 Jun 2016 22:14:48 GMT
Consider this two-line input file:
M !z
M /a
(! = ASCII 33; / = ASCII 47.)

Locale-dependent sort with debug:
LC_ALL=en_US.UTF-8 sort --debug -k2 /tmp/foo 

Output:
sort: using ‘en_US.UTF-8’ sorting rules
..
M /a
 ___
____
M !z
 ___
____

Due to the locale rules, the punctuation characters are being ignored
(presumably), or ! would sort before / (as it does with the LC_ALL=C
sort).  Therefore it seems the debug output would be closer to reality
if it was:

M /a
 _ _
____
M !z
 _ _
____

(I think; I'm not sure if all blanks are ignored in the locale
sort, or just multiple blanks collapsed to one.)

I realize that, in terms of mere string parsing, the punctuation is
included in the sort key.  But when a character is not actually used for
sorting, and the --debug output says it is, that seems suboptimal.
(Especially when the rules are, for all practical purposes,
undocumented.)

I also realize it is not necessarily feasible to change, even if there's
agreement on changing it.

@curmudgeon
How anyone can do anything useful with en_US.UTF-8 sort is beyond me ...
@end curmudgeon

Ok, no more from me in this area, you can be glad to know. --karl





Information forwarded to bug-coreutils <at> gnu.org:
bug#23677; Package coreutils. (Thu, 02 Jun 2016 16:58:01 GMT) Full text and rfc822 format available.

Message #8 received at 23677 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Karl Berry <karl <at> freefriends.org>
Cc: 23677 <at> debbugs.gnu.org
Subject: Re: bug#23677: sort --debug not ignoring punctuation when sort does
Date: Thu, 02 Jun 2016 18:57:52 +0200
Karl Berry <karl <at> freefriends.org> writes:

> Due to the locale rules, the punctuation characters are being ignored
> (presumably),

They are not ignored, just considered only secondary, if the first order
characters didn't provide an ordering.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




Information forwarded to bug-coreutils <at> gnu.org:
bug#23677; Package coreutils. (Thu, 02 Jun 2016 21:29:02 GMT) Full text and rfc822 format available.

Message #11 received at 23677 <at> debbugs.gnu.org (full text, mbox):

From: Karl Berry <karl <at> freefriends.org>
To: schwab <at> linux-m68k.org
Cc: 23677 <at> debbugs.gnu.org
Subject: Re: bug#23677: sort --debug not ignoring punctuation when sort does
Date: Thu, 2 Jun 2016 21:28:36 GMT
    They are not ignored, just considered only secondary, if the first
    order characters didn't provide an ordering.

Ok.  One would have no clue of that, either, from the --debug output.

sort obviously knows the exact rules defined by the locale, or it
couldn't do its job.  How about a way to dump the rules in some
human-readable way?  (In sort or another utility or a separate program
or whatever.)  Similar to how James Youngman found a way to write out
regex definitions in Texinfo ... just a wish ... -karl




Information forwarded to bug-coreutils <at> gnu.org:
bug#23677; Package coreutils. (Thu, 02 Jun 2016 22:10:02 GMT) Full text and rfc822 format available.

Message #14 received at 23677 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Karl Berry <karl <at> freefriends.org>, schwab <at> linux-m68k.org
Cc: 23677 <at> debbugs.gnu.org
Subject: Re: bug#23677: sort --debug not ignoring punctuation when sort does
Date: Thu, 2 Jun 2016 16:09:14 -0600
[Message part 1 (text/plain, inline)]
On 06/02/2016 03:28 PM, Karl Berry wrote:
>     They are not ignored, just considered only secondary, if the first
>     order characters didn't provide an ordering.
> 
> Ok.  One would have no clue of that, either, from the --debug output.
> 
> sort obviously knows the exact rules defined by the locale, or it
> couldn't do its job.

sort merely calls strcoll(); all the rules are a black box to sort, and
are really something that you have to know how strcoll() uses locale
definitions.

>  How about a way to dump the rules in some
> human-readable way?  (In sort or another utility or a separate program
> or whatever.)  Similar to how James Youngman found a way to write out
> regex definitions in Texinfo ... just a wish ... -karl

It might be nicer to request the glibc folks to give human-readable
descriptions of their locale files, and how strcoll() is affected by
those definitions, since it is more than just sort(1) that is impacted.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#23677; Package coreutils. (Sun, 28 Oct 2018 06:06:02 GMT) Full text and rfc822 format available.

Message #17 received at 23677 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 23677 <at> debbugs.gnu.org
Subject: Re: bug#23677: sort --debug not ignoring punctuation when sort does
Date: Sun, 28 Oct 2018 00:05:37 -0600
close 23677
stop

(triaging old bugs)

On 2016-06-02 4:09 p.m., Eric Blake wrote:
> On 06/02/2016 03:28 PM, Karl Berry wrote:
>>      They are not ignored, just considered only secondary, if the first
>>      order characters didn't provide an ordering.
>>
>> Ok.  One would have no clue of that, either, from the --debug output.
>>
>> sort obviously knows the exact rules defined by the locale, or it
>> couldn't do its job.
> 
> sort merely calls strcoll(); all the rules are a black box to sort, and
> are really something that you have to know how strcoll() uses locale
> definitions.
> 
>>   How about a way to dump the rules in some
>> human-readable way?  (In sort or another utility or a separate program
>> or whatever.)  Similar to how James Youngman found a way to write out
>> regex definitions in Texinfo ... just a wish ... -karl
> 
> It might be nicer to request the glibc folks to give human-readable
> descriptions of their locale files, and how strcoll() is affected by
> those definitions, since it is more than just sort(1) that is impacted.
> 

With no further follow-ups in 2 years,
I'm closing this bug.
Discussion continue by replying to this thread.

-assaf




bug closed, send any further explanations to 23677 <at> debbugs.gnu.org and Karl Berry <karl <at> freefriends.org> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 06:06:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 25 Nov 2018 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 59 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.