GNU bug report logs - #29044
sort --debug results improvement

Previous Next

Package: coreutils;

Reported by: Dan Jacobson <jidanni <at> jidanni.org>

Date: Sat, 28 Oct 2017 17:31:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 29044 in the body.
You can then email your comments to 29044 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#29044; Package coreutils. (Sat, 28 Oct 2017 17:31:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dan Jacobson <jidanni <at> jidanni.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 28 Oct 2017 17:31:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dan Jacobson <jidanni <at> jidanni.org>
To: bug-coreutils <at> gnu.org
Subject: sort --debug results improvement
Date: Sun, 29 Oct 2017 01:26:13 +0800
$ sort -k 2n -k 3n --debug file.txt
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
sort: key 2 is numeric and spans multiple fields
41 011 92.3 亞太
   ___
       ____
________________
41 011 97.1 大漢
   ___
       ____

OK but they look like they only span one field.

Also the user is confused if
________________
is a "key 3", or just a separator.

Therefore please say
": key 1" or "1" etc. at the end of each of them.
This is also important if there many keys.

And add a separator bar, made of -, =, etc. but not _.

Also the Info documentation doesn't mention how to inflence
"sort: using simple byte comparison"
which seems to always be printed when using --debug no matter what.




Information forwarded to bug-coreutils <at> gnu.org:
bug#29044; Package coreutils. (Sun, 29 Oct 2017 03:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 29044 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Dan Jacobson <jidanni <at> jidanni.org>, 29044 <at> debbugs.gnu.org
Subject: Re: bug#29044: sort --debug results improvement
Date: Sat, 28 Oct 2017 21:06:01 -0600
tag 29044 notabug
close 29044
thanks

Hello,

There are few issues at hand. Answering out of order:

> $ sort -k 2n -k 3n --debug file.txt
[...]
> Also the user is confused if
> ________________
> is a "key 3", or just a separator.
>
> Therefore please say
> ": key 1" or "1" etc. at the end of each of them.
> This is also important if there many keys.
>
> And add a separator bar, made of -, =, etc. but not _.

This is indeed a 3rd key: it is the default behavior
of the 'last resort' sorting by the entire line.
It is not a separator.

It is used to sort lines for which the specified keys are equal.
It can be disabled with "-s/--stable" option.

Consider the following:

Case 1: The first key is equal ("A" in both lines).
Sort then uses the last resort sorting and compares the entire
lines, making "A B" appear first:

  $ printf "%s\n" "A C" "A B" | sort --debug -k1,1
  A B
  _
  ___
  A C
  _
  ___


Case 2: Using "-s" disable last-resort, and lines with equal keys
are printed in the same order they were specified (hence "stable"):

  $ printf "%s\n" "A C" "A B" | sort --debug -k1,1 -s
  A C
  _
  A B
  _




On 2017-10-28 11:26 AM, Dan Jacobson wrote:
> $ sort -k 2n -k 3n --debug file.txt
> sort: using simple byte comparison
> sort: key 1 is numeric and spans multiple fields
> sort: key 2 is numeric and spans multiple fields
> 41 011 92.3 亞太
>     ___
>         ____
> ________________
> 41 011 97.1 大漢
>     ___
>         ____
> 
> OK but they look like they only span one field.

'sort --debug' will indicate the *actual* characters
that were used for the comparison.
In case of "-n" (numeric sort), the conversion to a numeric value
stopped at the space character, and it is indicated so.

This has nothing to do with the fact that the key specification
spans multiple fields for a single numeric key.


Consider the following cases (I'm using "-s" for all cases to
reduce clutter, it doesn't change the meaning):

Case 1: Because we used alphanumeric sorting order (the default),
All the characters until the first space are marked by "--debug":

  $ printf "%s\n" "11A A" "33 C" "4e4D D" | sort -k1,1 --debug -s
  11A A
  ___
  33 C
  __
  4e4D D
  ____


Case 2: with numeric sorting, only the digits are marked:

  $ printf "%s\n" "11A A" "33 C" "4e4D D" | sort -k1n,1 --debug -s
  4e4D D
  _
  11A A
  __
  33 C
  __


case 3: if using "-g" (general numeric sort, which can parse scientific 
notation) the "4e4" is parsed, but stopped at the "D" character:

  $ printf "%s\n" "11A A" "33 C" "4e4D D" | sort -s -k1g,1 --debug
  11A A
  __
  33 C
  __
  4e4D D
  ___



> Also the Info documentation doesn't mention how to inflence
> "sort: using simple byte comparison"
> which seems to always be printed when using --debug no matter what.

This message indicates you are sorting in the C/POSIX locale.
Perhaps it is the default locale on your system ?

"sort --debug" will always print the sorting rules, e.g.:

  $ LC_ALL=en_CA.UTF-8 sort --debug < /dev/null
  sort: using ‘en_CA.UTF-8’ sorting rules

  $ LC_ALL=C sort --debug < /dev/null
  sort: using simple byte comparison





As such,
I'm marking this item as not-a-bug and closing it, but discussion can 
continue by replying to this thread.

regards,
 - assaf








Information forwarded to bug-coreutils <at> gnu.org:
bug#29044; Package coreutils. (Sun, 29 Oct 2017 18:36:02 GMT) Full text and rfc822 format available.

Message #11 received at 29044 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29044 <at> debbugs.gnu.org
Subject: Re: bug#29044: sort --debug results improvement
Date: Mon, 30 Oct 2017 02:35:18 +0800
Your answer is absolutely pure gold for a new page linked from

‘--debug’
     Highlight the portion of each line used for sorting.  Also issue
     warnings about questionable usage to stderr.

in the Info manual! Please don't let it go to waste sitting in the bug
tracker. Perhaps call it Debugging examples. You can pretty much just
quote the entire exchange between you and me.

P.S., Yes indeed I had LC_COLLATE=C so maybe --debug should mention
where in the environment it made it choices from too.




Information forwarded to bug-coreutils <at> gnu.org:
bug#29044; Package coreutils. (Sun, 29 Oct 2017 18:41:01 GMT) Full text and rfc822 format available.

Message #14 received at 29044 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29044 <at> debbugs.gnu.org
Subject: Re: bug#29044: sort --debug results improvement
Date: Mon, 30 Oct 2017 02:40:16 +0800
< P.S., Yes indeed I had LC_COLLATE=C so maybe --debug should mention
< where in the environment it made it choices from too.

Ah, like you said

  $ LC_ALL=en_CA.UTF-8 sort --debug < /dev/null
  sort: using ‘en_CA.UTF-8’ sorting rules

  $ LC_ALL=C sort --debug < /dev/null
  sort: using simple byte comparison

So the last line should be
  sort: using 'C' sorting rules (simple byte comparison)

or maybe also say "effective LC_COLLATE value is ...."..




Information forwarded to bug-coreutils <at> gnu.org:
bug#29044; Package coreutils. (Sun, 29 Oct 2017 21:35:01 GMT) Full text and rfc822 format available.

Message #17 received at 29044 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>,
 Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29044 <at> debbugs.gnu.org
Subject: Re: bug#29044: sort --debug results improvement
Date: Sun, 29 Oct 2017 14:34:14 -0700
On 29/10/17 11:40, 積丹尼 Dan Jacobson wrote:
> < P.S., Yes indeed I had LC_COLLATE=C so maybe --debug should mention
> < where in the environment it made it choices from too.
> 
> Ah, like you said
> 
>   $ LC_ALL=en_CA.UTF-8 sort --debug < /dev/null
>   sort: using ‘en_CA.UTF-8’ sorting rules
> 
>   $ LC_ALL=C sort --debug < /dev/null
>   sort: using simple byte comparison
> 
> So the last line should be
>   sort: using 'C' sorting rules (simple byte comparison)
> 
> or maybe also say "effective LC_COLLATE value is ...."..

"C" sorting is badly named and assume prior knowledge,
and is also ambiguous with C.UTF8 etc.
I thought "simple byte comparison" was the most appropriate.

I agree we might mention the locale env vars,
though defaults, and significant env vars vary per system.

cheers,
Pádraig.





Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 01:46:04 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 29044 <at> debbugs.gnu.org and Dan Jacobson <jidanni <at> jidanni.org> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 01:46:04 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 27 Nov 2018 12:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 123 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.