GNU bug report logs - #49239
Unexpected results with sort -V

Previous Next

Package: coreutils;

Reported by: Michael <michael.debertol <at> gmail.com>

Date: Sun, 27 Jun 2021 06:37:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 49239 in the body.
You can then email your comments to 49239 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Sun, 27 Jun 2021 06:37:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael <michael.debertol <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 27 Jun 2021 06:37:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael <michael.debertol <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Unexpected results with sort -V
Date: Sun, 27 Jun 2021 00:04:53 +0200
[Message part 1 (text/plain, inline)]
Hi,
I found some unexpected results with sort -V. I hope this is the correct
place to send a bug report to [1].
They are caused by a bug in filevercmp inside gnulib, specifically in the
function match_suffix.
I assume it should, as documented, match a file ending as defined by this
regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
However, I found two cases where this does not happen:
1) Two consecutive dots. It is not checked if the character after a dot is
a dot. This results in nothing being matched in a case like "a..a", even
though it should match ".a" according to the regex.
Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
2) A trailing dot. If there is no additional character after a dot, it is
still matched (e.g. for "a." the . is matched).
Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.

Additionally I noticed that filevercmp ignores all characters after a NULL
byte.
This can be seen here: printf "a\0a\na" | sort -Vs
sort seems to otherwise consider null bytes (that's why the --stable flag
is necessary in the above example). Is this the expected behavior?

Finally I wanted to ask if it is the expected behavior for filevercmp to do
a strcmp if it can't find another difference, at least from the perspective
of sort.
This means that the --stable flag for sort has no effect in combination
with --version-sort (well, except if the input contains NULL bytes, as
mentioned above :)

I'll attach a rather simple patch to fix 1) and 2) (including test), I hope
that's right.

Have a nice day,
Michael

[1]:
https://www.gnu.org/software/coreutils/manual/html_node/Reporting-bugs-or-incorrect-results.html#Reporting-bugs-or-incorrect-results
[Message part 2 (text/html, inline)]
[diff.txt (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 16:42:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Kamil Dudka <kdudka <at> redhat.com>
To: Michael <michael.debertol <at> gmail.com>
Cc: 49239 <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
Subject: Re: bug#49239: Unexpected results with sort -V
Date: Mon, 28 Jun 2021 18:41:17 +0200
On Sunday, June 27, 2021 12:04:53 AM CEST Michael wrote:
> Hi,
> I found some unexpected results with sort -V. I hope this is the correct
> place to send a bug report to [1].
> They are caused by a bug in filevercmp inside gnulib, specifically in the
> function match_suffix.
> I assume it should, as documented, match a file ending as defined by this
> regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
> However, I found two cases where this does not happen:
> 1) Two consecutive dots. It is not checked if the character after a dot is
> a dot. This results in nothing being matched in a case like "a..a", even
> though it should match ".a" according to the regex.
> Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
> 2) A trailing dot. If there is no additional character after a dot, it is
> still matched (e.g. for "a." the . is matched).
> Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.

As far as I understand, regex (\.[A-Za-z~][A-Za-z0-9~]*)*$ specifies that each 
dot has to be followed by [A-Za-z~] to be matched.  Am I missing anything?

I am not saying that the current behavior is perfect (a solution that works as 
expected in all scenarios is difficult to find in this case) but, at least, it 
seems to me that it works as it is described.

Kamil






Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 16:42:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 16:53:01 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Debertol <michael.debertol <at> gmail.com>
To: Kamil Dudka <kdudka <at> redhat.com>
Cc: 49239 <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
Subject: Re: bug#49239: Unexpected results with sort -V
Date: Mon, 28 Jun 2021 18:52:14 +0200
Am 28.06.21 um 18:41 schrieb Kamil Dudka:
> On Sunday, June 27, 2021 12:04:53 AM CEST Michael wrote:
>> Hi,
>> I found some unexpected results with sort -V. I hope this is the correct
>> place to send a bug report to [1].
>> They are caused by a bug in filevercmp inside gnulib, specifically in the
>> function match_suffix.
>> I assume it should, as documented, match a file ending as defined by this
>> regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
>> However, I found two cases where this does not happen:
>> 1) Two consecutive dots. It is not checked if the character after a dot is
>> a dot. This results in nothing being matched in a case like "a..a", even
>> though it should match ".a" according to the regex.
>> Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
>> 2) A trailing dot. If there is no additional character after a dot, it is
>> still matched (e.g. for "a." the . is matched).
>> Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.
> As far as I understand, regex (\.[A-Za-z~][A-Za-z0-9~]*)*$ specifies that each
> dot has to be followed by [A-Za-z~] to be matched.  Am I missing anything?
>
> I am not saying that the current behavior is perfect (a solution that works as
> expected in all scenarios is difficult to find in this case) but, at least, it
> seems to me that it works as it is described.

I was trying to say that the regex is not followed in two cases:

- when there are two dots followed by [A-Za-z~], the second dot should 
be matched, but it is not.

An example is "foo..a": In this case ".a" should be matched, but it is 
not (nothing is matched)

- when there's a trailing dot, the trailing dot is matched even though 
it is not followed by anything

e.g. "foo." matches the "." as the file ending, but it should not match 
a file ending in this case.

I hope it's clearer now,

Michael





Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 16:53:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 17:55:01 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Kamil Dudka <kdudka <at> redhat.com>
To: Michael Debertol <michael.debertol <at> gmail.com>
Cc: 49239 <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
Subject: Re: bug#49239: Unexpected results with sort -V
Date: Mon, 28 Jun 2021 19:54:25 +0200
On Monday, June 28, 2021 6:52:14 PM CEST Michael Debertol wrote:
> I was trying to say that the regex is not followed in two cases:
> 
> - when there are two dots followed by [A-Za-z~], the second dot should
> be matched, but it is not.
> 
> An example is "foo..a": In this case ".a" should be matched, but it is
> not (nothing is matched)
> 
> - when there's a trailing dot, the trailing dot is matched even though
> it is not followed by anything
> 
> e.g. "foo." matches the "." as the file ending, but it should not match
> a file ending in this case.
> 
> I hope it's clearer now,
> 
> Michael

You are right.  The matching algorithm was not implemented correctly and
the patch you attached fixes it.  Sorry for missing it in my previous reply.

Kamil






Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Mon, 28 Jun 2021 17:55:02 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 13 Feb 2022 05:32:02 GMT) Full text and rfc822 format available.

Notification sent to Michael <michael.debertol <at> gmail.com>:
bug acknowledged by developer. (Sun, 13 Feb 2022 05:32:02 GMT) Full text and rfc822 format available.

Message #28 received at 49239-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Kamil Dudka <kdudka <at> redhat.com>,
 Michael Debertol <michael.debertol <at> gmail.com>
Cc: 49239-done <at> debbugs.gnu.org, Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#49239: Unexpected results with sort -V
Date: Sat, 12 Feb 2022 21:31:33 -0800
[Message part 1 (text/plain, inline)]
On 6/28/21 10:54, Kamil Dudka wrote:
> You are right.  The matching algorithm was not implemented correctly and
> the patch you attached fixes it.

I looked into Bug#49239 and found some more places where the 
documentation disagreed with the code. I installed the attached patches 
into Gnulib and Coreutils, respectively, which should bring the two into 
agreement and should fix the bugs that Michael reported albeit in a 
different way than his proposed patch. Briefly:

* The code didn't allow file name suffixes to be the entire file name, 
but the documentation did. Here I went with the documentation. I could 
be talked into the other way; it shouldn't matter much either way.

* The code did the preliminary test (without suffixes) using strcmp, the 
documentation said it should use version comparison. Here I went with 
the documentation.

* As Michael mentioned, sort -V mishandled NUL. I fixed this by adding a 
Gnulib function filenvercmp that treats NUL as just another character.

* As Michael also mentioned, filevercmp fell back on strcmp if version 
sort found no difference, which meant sort's --stable flag was 
ineffective. I fixed this by not having filevercmp fall back on strcmp.

* I fixed the two-consecutive dot and trailing-dot bugs Michael 
mentioned, by rewriting the suffix finder to not have that confusing 
READ_ALPHA state variable, and to instead implement the regular 
expression's nested * operators in the usual way with nested loops.

Thanks, Michael, for reporting the problem. I'm boldly closing the 
Coreutils bug report as fixed.
[0001-filevercmp-fix-several-unexpected-results.patch (text/x-patch, attachment)]
[0001-sort-fix-several-version-sort-problems.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#49239; Package coreutils. (Sun, 13 Feb 2022 14:18:01 GMT) Full text and rfc822 format available.

Message #31 received at 49239 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Kamil Dudka <kdudka <at> redhat.com>,
 Michael Debertol <michael.debertol <at> gmail.com>
Cc: 49239 <at> debbugs.gnu.org
Subject: Re: bug#49239: Unexpected results with sort -V
Date: Sun, 13 Feb 2022 14:17:16 +0000
On 13/02/2022 05:31, Paul Eggert wrote:
> On 6/28/21 10:54, Kamil Dudka wrote:
>> You are right.  The matching algorithm was not implemented correctly and
>> the patch you attached fixes it.
> 
> I looked into Bug#49239 and found some more places where the
> documentation disagreed with the code. I installed the attached patches
> into Gnulib and Coreutils, respectively, which should bring the two into
> agreement and should fix the bugs that Michael reported albeit in a
> different way than his proposed patch. Briefly:
> 
> * The code didn't allow file name suffixes to be the entire file name,
> but the documentation did. Here I went with the documentation. I could
> be talked into the other way; it shouldn't matter much either way.
> 
> * The code did the preliminary test (without suffixes) using strcmp, the
> documentation said it should use version comparison. Here I went with
> the documentation.
> 
> * As Michael mentioned, sort -V mishandled NUL. I fixed this by adding a
> Gnulib function filenvercmp that treats NUL as just another character.
> 
> * As Michael also mentioned, filevercmp fell back on strcmp if version
> sort found no difference, which meant sort's --stable flag was
> ineffective. I fixed this by not having filevercmp fall back on strcmp.
> 
> * I fixed the two-consecutive dot and trailing-dot bugs Michael
> mentioned, by rewriting the suffix finder to not have that confusing
> READ_ALPHA state variable, and to instead implement the regular
> expression's nested * operators in the usual way with nested loops.
> 
> Thanks, Michael, for reporting the problem. I'm boldly closing the
> Coreutils bug report as fixed.

A very thorough analysis.
All looks good.

thank you!
Pádraig




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 14 Mar 2022 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 15 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.