GNU bug report logs - #51011
[GNU sort] Numerical sort with delimiter may be broken (bug)

Previous Next

Package: coreutils;

Reported by: Juncheng Yang <peter.waynechina <at> gmail.com>

Date: Mon, 4 Oct 2021 15:04:01 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 51011 in the body.
You can then email your comments to 51011 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 15:04:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juncheng Yang <peter.waynechina <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 04 Oct 2021 15:04:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Juncheng Yang <peter.waynechina <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: [GNU sort] Numerical sort with delimiter may be broken (bug) 
Date: Mon, 4 Oct 2021 10:36:52 -0400
[Message part 1 (text/plain, inline)]
Hi coreutils developers, 
    I have encountered a bug in GNU sort in which sort produces incorrect results when numerical sort with delimiters. For example, 
sort -nk1 -t , file 
cannot sort the a file with the following lines (sort by the first column numerically) 
1,a
0,9

I have tried multiple version including the latest version, this problem still exists. 


Best, 
Juncheng 


[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 15:19:03 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Davide Brini <dave_br <at> gmx.com>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be
 broken (bug)
Date: Mon, 4 Oct 2021 17:18:35 +0200
On Mon, 4 Oct 2021 10:36:52 -0400, Juncheng Yang
<peter.waynechina <at> gmail.com> wrote:

> Hi coreutils developers,
>     I have encountered a bug in GNU sort in which sort produces incorrect
> results when numerical sort with delimiters. For example, sort -nk1 -t ,
> file cannot sort the a file with the following lines (sort by the first
> column numerically)
> 1,a
> 0,9
>
> I have tried multiple version including the latest version, this problem
> still exists.

Works for me with

sort -t, -k1,1n

Keep in mind that with just "-k1" you're effectively telling sort to
consider fields from the first up to the last (ie the whole line), not just
the first one.


--
D.




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 15:59:01 GMT) Full text and rfc822 format available.

Message #11 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Mon, 4 Oct 2021 16:58:38 +0100
tag 51011 notabug
close 51011
stop

On 04/10/2021 15:36, Juncheng Yang wrote:
> Hi coreutils developers,
>      I have encountered a bug in GNU sort in which sort produces incorrect results when numerical sort with delimiters. For example,
> sort -nk1 -t , file
> cannot sort the a file with the following lines (sort by the first column numerically)
> 1,a
> 0,9
> 
> I have tried multiple version including the latest version, this problem still exists.

The --debug option points out the issue:

  $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
  sort: key 1 is numeric and spans multiple fields
  1,a
  _
  ___
  0,9
  ___
  ___


So you want -k1,1n

cheers,
Pádraig




Added tag(s) notabug. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Mon, 04 Oct 2021 15:59:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 51011 <at> debbugs.gnu.org and Juncheng Yang <peter.waynechina <at> gmail.com> Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Mon, 04 Oct 2021 15:59:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 16:30:03 GMT) Full text and rfc822 format available.

Message #18 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Juncheng Yang <peter.waynechina <at> gmail.com>
To: 51011 <at> debbugs.gnu.org
Subject: Problem solved - thoughts on confusing behavior 
Date: Mon, 4 Oct 2021 11:29:07 -0400
Hi developers, 
    It looks like I had misunderstanding of how `-k` works, by changing to -k 1,1 now it works. 
    However, this is confusing because 1) the behavior of `-n` and `-g` are not consistent, 2) the `-n` in GNU sort is different from the sort on MacOS (which has pos2 as pos1+1 instead of 0)… 


Best, 
Juncheng 



Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 20:03:02 GMT) Full text and rfc822 format available.

Message #21 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Mon, 4 Oct 2021 13:01:56 -0700
On 10/4/21 08:58, Pádraig Brady wrote:
> The --debug option points out the issue:
> 
>    $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
>    sort: key 1 is numeric and spans multiple fields
>    1,a
>    _
>    ___
>    0,9
>    ___
>    ___

As Juncheng points out, it is a bit odd that -n and -g disagree here, 
even in locales where ',' is not a decimal point. For example:

$ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
0,9
_
___
1,a
_
___
$ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,a
_
___
0,9
___
___




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 20:03:02 GMT) Full text and rfc822 format available.

Message #24 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: Problem solved - thoughts on confusing behavior
Date: Mon, 4 Oct 2021 13:01:52 -0700
On 10/4/21 08:29, Juncheng Yang wrote:
> However, this is confusing because 1) the behavior of `-n` and `-g` are not consistent

Yes, that is confusing. I have followed up to Pádraig about this.

, 2) the `-n` in GNU sort is different from the sort on MacOS (which has 
pos2 as pos1+1 instead of 0)…

GNU sort does that too; that's old-fashioned syntax that is not 
recommended nowadays; it's better to use the -k option. macOS 'sort' 
supports supports -k too, surely.




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 22:52:02 GMT) Full text and rfc822 format available.

Message #27 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Juncheng Yang <peter.waynechina <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 51011 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Mon, 4 Oct 2021 18:51:50 -0400
Thank you, Paul and Padraig! 
May I ask when it fails to sort numerically why 1,a comes before 0,9? I could not come up with an ordering that 1,a is smaller. 


Best, 
Jason 


> On Oct 4, 2021, at 4:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> 
> On 10/4/21 08:58, Pádraig Brady wrote:
>> The --debug option points out the issue:
>>   $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
>>   sort: key 1 is numeric and spans multiple fields
>>   1,a
>>   _
>>   ___
>>   0,9
>>   ___
>>   ___
> 
> As Juncheng points out, it is a bit odd that -n and -g disagree here, even in locales where ',' is not a decimal point. For example:
> 
> $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 0,9
> _
> ___
> 1,a
> _
> ___
> $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 1,a
> _
> ___
> 0,9
> ___
> ___





Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 04 Oct 2021 22:57:01 GMT) Full text and rfc822 format available.

Message #30 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Juncheng Yang <peter.waynechina <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: Problem solved - thoughts on confusing behavior
Date: Mon, 4 Oct 2021 18:56:16 -0400
Thank you, Paul! :) 
In my test, the -k option of the sort on Mac behaves differently from GNU sort (I made a mistake stating -n). In other words, 
printf '%s\n' 1,a 0,9 | sort -nk1 -t ,
works on Mac, and this is why I thought GNU sort has a bug at first. 
Thank you again for your quick response! The GNU tools (and maintainers/contributors) are really amazing! 


Best, 
Juncheng 


> On Oct 4, 2021, at 4:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> 
> On 10/4/21 08:29, Juncheng Yang wrote:
>> However, this is confusing because 1) the behavior of `-n` and `-g` are not consistent
> 
> Yes, that is confusing. I have followed up to Pádraig about this.
> 
> , 2) the `-n` in GNU sort is different from the sort on MacOS (which has pos2 as pos1+1 instead of 0)…
> 
> GNU sort does that too; that's old-fashioned syntax that is not recommended nowadays; it's better to use the -k option. macOS 'sort' supports supports -k too, surely.





Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Fri, 08 Oct 2021 13:38:02 GMT) Full text and rfc822 format available.

Message #33 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>,
 Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Fri, 8 Oct 2021 14:37:42 +0100
On 04/10/2021 21:01, Paul Eggert wrote:
> On 10/4/21 08:58, Pádraig Brady wrote:
>> The --debug option points out the issue:
>>
>>     $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
>>     sort: key 1 is numeric and spans multiple fields
>>     1,a
>>     _
>>     ___
>>     0,9
>>     ___
>>     ___
> 
> As Juncheng points out, it is a bit odd that -n and -g disagree here,
> even in locales where ',' is not a decimal point. For example:
> 
> $ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 0,9
> _
> ___
> 1,a
> _
> ___
> $ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
> sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
> sort: key 1 is numeric and spans multiple fields
> 1,a
> _
> ___
> 0,9
> ___
> ___

The difference here is due to ',' being treated as a thousands sep,
not a decimal point. So Juncheng to specifically answer your question,
0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider:

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug
0,900
_
1,a
_

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug
1,a
_
0,900
_____


Given the various groupings possible (depending on locale
one can group in 2, 3, 4, 5 digits) we effectively just
ignore the grouping separator in numeric mode, hence the difference.

Note in locales where , is a decimal point we do get
consistent order between -g and -n as expected:

$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__
$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__

For completeness we do have another issue with grouping separators,
where we don't support multi-byte separators appropriately.
For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator,
which we don't support:

$ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep)
$ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0 900
_
0800
____


cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Fri, 08 Oct 2021 20:49:02 GMT) Full text and rfc822 format available.

Message #36 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Fri, 8 Oct 2021 13:48:17 -0700
On 10/8/21 6:37 AM, Pádraig Brady wrote:
> 
> The difference here is due to ',' being treated as a thousands sep,
> not a decimal point.

Oh, thanks. Of course! I should have figured that out myself.

It is unfortunate that "," is treated as a thousands seperator even 
though it's obviously not one (as it's not followed by 3 decimal 
digits). I don't think POSIX requires this behavior; it's not clear to 
me that POSIX even allows it.

This bug report suggests that we should alter the code so that 'sort -n' 
acts more like common practice, and requires thousands separators to be 
in the right places in order to treat nearby digits to be part of the 
number. Alternatively, we could document the existing behavior (even if 
it's not clear that it conforms to POSIX).




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sat, 09 Oct 2021 02:33:01 GMT) Full text and rfc822 format available.

Message #39 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Sat, 9 Oct 2021 03:32:08 +0100
On 08/10/2021 21:48, Paul Eggert wrote:
> On 10/8/21 6:37 AM, Pádraig Brady wrote:
>>
>> The difference here is due to ',' being treated as a thousands sep,
>> not a decimal point.
> 
> Oh, thanks. Of course! I should have figured that out myself.
> 
> It is unfortunate that "," is treated as a thousands seperator even
> though it's obviously not one (as it's not followed by 3 decimal
> digits). I don't think POSIX requires this behavior; it's not clear to
> me that POSIX even allows it.

Well in general it's not a thousands separator, rather a grouping character,
and groups can be in 2, 3, 4, and even 5.  So I don't think we should
change the logic here.

> This bug report suggests that we should alter the code so that 'sort -n'
> acts more like common practice, and requires thousands separators to be
> in the right places in order to treat nearby digits to be part of the
> number. Alternatively, we could document the existing behavior (even if
> it's not clear that it conforms to POSIX).

What we can do is have --debug warn when there is an overlap
in --field-separator and the grouping and decimal characters
when using numeric keys.  I'll have a look at that tomorrow.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sat, 09 Oct 2021 03:49:01 GMT) Full text and rfc822 format available.

Message #42 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Fri, 8 Oct 2021 20:48:18 -0700
On 10/8/21 7:32 PM, Pádraig Brady wrote:
> it's not a thousands separator, rather a grouping 
> character,
> and groups can be in 2, 3, 4, and even 5.

Sure, but 'sort' could determine the group sizes from the locale, and 
reject digit strings that are formatted improperly according to the 
group-size rules. (Not that I plan to write the code to do that....)




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sat, 09 Oct 2021 12:02:01 GMT) Full text and rfc822 format available.

Message #45 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Sat, 9 Oct 2021 13:00:53 +0100
On 09/10/2021 04:48, Paul Eggert wrote:
> On 10/8/21 7:32 PM, Pádraig Brady wrote:
>> it's not a thousands separator, rather a grouping
>> character,
>> and groups can be in 2, 3, 4, and even 5.
> 
> Sure, but 'sort' could determine the group sizes from the locale, and
> reject digit strings that are formatted improperly according to the
> group-size rules. (Not that I plan to write the code to do that....)

Yes I agree that would be better, but not worth it I think
as there would still be ambiguity in what was a grouping char
and what was a field separator. Also that ambiguity would
now vary across locales.

Another possible change which I'd prefer TBH
would be to disable the grouping separator, or decimal point
if they overlapped with --field-separator.
Doing this would induce a warning from --debug also.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sat, 09 Oct 2021 22:30:03 GMT) Full text and rfc822 format available.

Message #48 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [GNU sort] Numerical sort with delimiter may be broken
 (bug)
Date: Sat, 9 Oct 2021 15:29:02 -0700
On 10/9/21 5:00 AM, Pádraig Brady wrote:
> On 09/10/2021 04:48, Paul Eggert wrote:

>> 'sort' could determine the group sizes from the locale, and
>> reject digit strings that are formatted improperly according to the
>> group-size rules. (Not that I plan to write the code to do that....)
> 
> Yes I agree that would be better, but not worth it I think
> as there would still be ambiguity in what was a grouping char
> and what was a field separator. Also that ambiguity would
> now vary across locales.

I don't see the ambiguity problem. The field separator is used to 
identify fields; once the fields are identified, the thousands 
separator, decimal point, etc. contribute to numeric comparison in the 
usual way. So it's OK (albeit confusing) for the field separator to be 
'.' or ',' or '-' or '0' or any another character that could be part of 
a number.

For example, with 'sort -t 0 -k 2,2n', the digit 0 is not part of the 
numeric field that is compared, and there's no ambiguity about that even 
though 0 is allowed in numbers. The same idea applies to 'sort -t , -k 
2,2n'.




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 10 Oct 2021 17:59:02 GMT) Full text and rfc822 format available.

Message #51 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: [PATCH] sort: --debug: add warnings about radix and grouping chars
Date: Sun, 10 Oct 2021 18:57:57 +0100
[Message part 1 (text/plain, inline)]
On 09/10/2021 23:29, Paul Eggert wrote:
> On 10/9/21 5:00 AM, Pádraig Brady wrote:
>> On 09/10/2021 04:48, Paul Eggert wrote:
> 
>>> 'sort' could determine the group sizes from the locale, and
>>> reject digit strings that are formatted improperly according to the
>>> group-size rules. (Not that I plan to write the code to do that....)
>>
>> Yes I agree that would be better, but not worth it I think
>> as there would still be ambiguity in what was a grouping char
>> and what was a field separator. Also that ambiguity would
>> now vary across locales.
> 
> I don't see the ambiguity problem. The field separator is used to
> identify fields; once the fields are identified, the thousands
> separator, decimal point, etc. contribute to numeric comparison in the
> usual way. So it's OK (albeit confusing) for the field separator to be
> '.' or ',' or '-' or '0' or any another character that could be part of
> a number.
> 
> For example, with 'sort -t 0 -k 2,2n', the digit 0 is not part of the
> numeric field that is compared, and there's no ambiguity about that even
> though 0 is allowed in numbers. The same idea applies to 'sort -t , -k
> 2,2n'.

Indeed. I dropped -t, from my later examples and confused myself.

Attached is the proposed change to add appropriate warnings in this area.
Examples now diagnosed are:

  $ printf '0,9\n1,a\n' | sort -nk1 --debug -t, -s
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘,’ is treated as a group separator in numbers
  1,a
  _
  0,9
  ___

  $ printf '1,a\n0,9\n' | LC_ALL=fr_FR.utf8 sort -gk1 --debug -t, -s
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘,’ is treated as a decimal point in numbers
  0,9
  ___
  1,a
  __

  $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
  sort: numbers use ‘.’ as a decimal point in this locale
  0.9
  ___
  1.0
  ___

  $ printf '1.0\n0.9\n' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
  sort: numbers use ‘,’ as a decimal point in this locale
  0.9
  _
  1.0
  _

cheers,
Pádraig
[sort--debug-radix.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 10 Oct 2021 21:22:02 GMT) Full text and rfc822 format available.

Message #54 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Pádraig Brady <P <at> draigBrady.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [PATCH] sort: --debug: add warnings about radix and
 grouping chars
Date: Sun, 10 Oct 2021 23:20:50 +0200
On 10/10/21 19:57, Pádraig Brady wrote:
>    sort: numbers use ‘.’ as a decimal point in this locale

What about adding the hint to that message that this an "ambiguity warning"?

    sort: ambiguity warning: numbers use ‘.’ as a decimal point in this locale

(Likewise for the other cases, of course.)

Most other --debug messages usually state how sort processes the options / the
input, while this one tells why the processing potentially does not work as the
user expected.

+1 otherwise.

Have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 10 Oct 2021 21:46:01 GMT) Full text and rfc822 format available.

Message #57 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>,
 Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [PATCH] sort: --debug: add warnings about radix and
 grouping chars
Date: Sun, 10 Oct 2021 14:45:32 -0700
On 10/10/21 2:20 PM, Bernhard Voelker wrote:
> What about adding the hint to that message that this an "ambiguity warning"?

I don't think it's ambiguous (merely confusing :-).




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 10 Oct 2021 23:35:02 GMT) Full text and rfc822 format available.

Message #60 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: [PATCH] sort: --debug: add warnings about radix and grouping chars
Date: Sun, 10 Oct 2021 16:34:06 -0700
[Message part 1 (text/plain, inline)]
The warnings look good, except that this one:

>    $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
>    sort: numbers use ‘.’ as a decimal point in this locale
>    0.9
>    ___
>    1.0
>    ___

seems overkill if we're in the C locale.

Also, shouldn't similar diagnostics be generated if the field separator 
is '-', or '+', or a digit in the current locale?


> +  if (numeric_field_span)
> +    {
> +      char sep_string[2] = { 0, };
> +      sep_string[0] = thousands_sep;
> +      if ((tab == TAB_DEFAULT
> +           && (isblank (to_uchar (thousands_sep))))
> +          || tab == thousands_sep)
> +        {
> +          error (0, 0,
> +                 _("field separator %s is treated as a "
> +                   "group separator in numbers"),
> +                 quote (sep_string));
> +          number_locale_warned = true;
> +        }
> +    }

This code brought it to my attention that the GNU 'sort' has had a 
longstanding bug (in code that I wrote long ago - sorry!) in that 
thousands_sep is either -1 or an unsigned char converted to int, and 
this doesn't work in some unusual cases. I installed the attached patch 
to fix that bug, and I vaguely suspect that it fixes similar bugs in GNU 
'test' and GNU 'expr'. Good thing you brought it to my attention. 
(Sorry, I'm too lazy and/or time-pressed and/or overconfident to write 
test cases....)

Anyway, with that patch installed, TAB and THOUSANDS_SEP can both be 
CHAR_MAX + 1 so the above code needs to be twiddled. Also, we can assume 
C99. So, something like following (pardon Thunderbird's line wrap):

  if (numeric_field_span
      && (tab == TAB_DEFAULT
	  ? thousands_char != NON_CHAR && isblank (to_uchar (thousands_sep))
	  : tab == thousands_sep))
    {
      error (0, 0,
	     _("field separator %s is treated as a group separator in numbers"),
	     quote (((char []) {thousands_sep, 0})));
      number_locale_warned = true;
    }

with a similar replacement to the decimal_point code.
[0001-sort-fix-unlikely-bug-when-377-0.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 11 Oct 2021 01:48:02 GMT) Full text and rfc822 format available.

Message #63 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [PATCH] sort: --debug: add warnings about radix and
 grouping chars
Date: Mon, 11 Oct 2021 02:47:40 +0100
On 10/10/2021 22:20, Bernhard Voelker wrote:
> On 10/10/21 19:57, Pádraig Brady wrote:
>>     sort: numbers use ‘.’ as a decimal point in this locale
> 
> What about adding the hint to that message that this an "ambiguity warning"?
> 
>      sort: ambiguity warning: numbers use ‘.’ as a decimal point in this locale
> 
> (Likewise for the other cases, of course.)
> 
> Most other --debug messages usually state how sort processes the options / the
> input, while this one tells why the processing potentially does not work as the
> user expected.
> 
> +1 otherwise.

Yes it may be useful to tag messages as "informational" or "problematic".
I've mentioned this also in a reply to Paul.

thanks for the review,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Mon, 11 Oct 2021 01:55:01 GMT) Full text and rfc822 format available.

Message #66 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: [PATCH] sort: --debug: add warnings about radix and grouping chars
Date: Mon, 11 Oct 2021 02:54:21 +0100
On 11/10/2021 00:34, Paul Eggert wrote:
> The warnings look good, except that this one:
> 
>>     $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
>>     sort: numbers use ‘.’ as a decimal point in this locale
>>     0.9
>>     ___
>>     1.0
>>     ___
> 
> seems overkill if we're in the C locale.
> 
> Also, shouldn't similar diagnostics be generated if the field separator
> is '-', or '+', or a digit in the current locale?

Yes this is more informational than a warning.
As Bernhard mentioned it may be useful to tag
--debug messages as informational or warnings.

In this case it would be info:
but would change to warn: if (tab == decimal_point).

The reason for the info message is that it may not
be obvious to users that numeric comparison
depends on locale just like text,
and we already provide the informational
text comparison message indicating the current locale.
We would only show this info: message if doing numeric sorting.

>> +  if (numeric_field_span)
>> +    {
>> +      char sep_string[2] = { 0, };
>> +      sep_string[0] = thousands_sep;
>> +      if ((tab == TAB_DEFAULT
>> +           && (isblank (to_uchar (thousands_sep))))
>> +          || tab == thousands_sep)
>> +        {
>> +          error (0, 0,
>> +                 _("field separator %s is treated as a "
>> +                   "group separator in numbers"),
>> +                 quote (sep_string));
>> +          number_locale_warned = true;
>> +        }
>> +    }
> 
> This code brought it to my attention that the GNU 'sort' has had a
> longstanding bug (in code that I wrote long ago - sorry!) in that
> thousands_sep is either -1 or an unsigned char converted to int, and
> this doesn't work in some unusual cases. I installed the attached patch
> to fix that bug, and I vaguely suspect that it fixes similar bugs in GNU
> 'test' and GNU 'expr'. Good thing you brought it to my attention.
> (Sorry, I'm too lazy and/or time-pressed and/or overconfident to write
> test cases....)

I'd noted this and was going to follow up on it.
Thanks for sorting it out!

> Anyway, with that patch installed, TAB and THOUSANDS_SEP can both be
> CHAR_MAX + 1 so the above code needs to be twiddled. Also, we can assume
> C99. So, something like following (pardon Thunderbird's line wrap):
> 
>     if (numeric_field_span
>         && (tab == TAB_DEFAULT
> 	  ? thousands_char != NON_CHAR && isblank (to_uchar (thousands_sep))
> 	  : tab == thousands_sep))
>       {
>         error (0, 0,
> 	     _("field separator %s is treated as a group separator in numbers"),
> 	     quote (((char []) {thousands_sep, 0})));
>         number_locale_warned = true;
>       }
> 
> with a similar replacement to the decimal_point code.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 31 Oct 2021 22:02:01 GMT) Full text and rfc822 format available.

Message #69 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [PATCH] sort: --debug: add warnings about radix and
 grouping chars
Date: Sun, 31 Oct 2021 22:01:12 +0000
[Message part 1 (text/plain, inline)]
On 11/10/2021 02:54, Pádraig Brady wrote:
> On 11/10/2021 00:34, Paul Eggert wrote:
>> The warnings look good, except that this one:
>>
>>>      $ printf '1.0\n0.9\n' | sort -s -k1,1g --debug
>>>      sort: numbers use ‘.’ as a decimal point in this locale
>>>      0.9
>>>      ___
>>>      1.0
>>>      ___
>>
>> seems overkill if we're in the C locale.
>>
>> Also, shouldn't similar diagnostics be generated if the field separator
>> is '-', or '+', or a digit in the current locale?
> 
> Yes this is more informational than a warning.
> As Bernhard mentioned it may be useful to tag
> --debug messages as informational or warnings.
> 
> In this case it would be info:
> but would change to warn: if (tab == decimal_point).
> 
> The reason for the info message is that it may not
> be obvious to users that numeric comparison
> depends on locale just like text,
> and we already provide the informational
> text comparison message indicating the current locale.
> We would only show this info: message if doing numeric sorting.

Addressing your '+' and '-' comment.
Yes they may also be used as field separators and
so are worth explicitly warning about.

Re warning about using digits in --field-separator,
that would be extremely edge case, and anyway
the --debug key marking should make it apparent
the extent of the numbers being compared.
The same argument can be made for other characters possible in numbers like;
  1E+4 nan, Infinity, 0xabcde.fp-3, etc.

As a related issue, I also thought it appropriate to warn
when we're ignoring multi-byte grouping chars in the locale.

The new warnings in this update are:

  $ LC_ALL=fr_FR.utf8 sort -n --debug /dev/null
  sort: the multi-byte number group separator in this locale is not supported

  $ sort --debug -t- -k1n /dev/null
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘-’ is treated as a minus sign in numbers

  $ sort --debug -t+ -k1g /dev/null
  sort: key 1 is numeric and spans multiple fields
  sort: field separator ‘+’ is treated as a plus sign in numbers

I'll apply this later.

cheers,
Pádraig
[sort--debug-number-span.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#51011; Package coreutils. (Sun, 31 Oct 2021 22:13:02 GMT) Full text and rfc822 format available.

Message #72 received at 51011 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Juncheng Yang <peter.waynechina <at> gmail.com>, 51011 <at> debbugs.gnu.org
Subject: Re: bug#51011: [PATCH] sort: --debug: add warnings about radix and
 grouping chars
Date: Sun, 31 Oct 2021 15:12:42 -0700
Thank you for working on this. Your points are well taken. One tiny comment:

> +  if (basic_numeric_field)
> +    {
> +      if (thousands_sep_ignored)

This might be better combined as "if (basic_numeric_field && 
thousands_sep_ignored)", so that it's more similar to the previous "if".




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 29 Nov 2021 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 147 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.