GNU bug report logs -
#49239
Unexpected results with sort -V
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 49239 in the body.
You can then email your comments to 49239 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Sun, 27 Jun 2021 06:37:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Michael <michael.debertol <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sun, 27 Jun 2021 06:37:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
I found some unexpected results with sort -V. I hope this is the correct
place to send a bug report to [1].
They are caused by a bug in filevercmp inside gnulib, specifically in the
function match_suffix.
I assume it should, as documented, match a file ending as defined by this
regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
However, I found two cases where this does not happen:
1) Two consecutive dots. It is not checked if the character after a dot is
a dot. This results in nothing being matched in a case like "a..a", even
though it should match ".a" according to the regex.
Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
2) A trailing dot. If there is no additional character after a dot, it is
still matched (e.g. for "a." the . is matched).
Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.
Additionally I noticed that filevercmp ignores all characters after a NULL
byte.
This can be seen here: printf "a\0a\na" | sort -Vs
sort seems to otherwise consider null bytes (that's why the --stable flag
is necessary in the above example). Is this the expected behavior?
Finally I wanted to ask if it is the expected behavior for filevercmp to do
a strcmp if it can't find another difference, at least from the perspective
of sort.
This means that the --stable flag for sort has no effect in combination
with --version-sort (well, except if the input contains NULL bytes, as
mentioned above :)
I'll attach a rather simple patch to fix 1) and 2) (including test), I hope
that's right.
Have a nice day,
Michael
[1]:
https://www.gnu.org/software/coreutils/manual/html_node/Reporting-bugs-or-incorrect-results.html#Reporting-bugs-or-incorrect-results
[Message part 2 (text/html, inline)]
[diff.txt (text/plain, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 16:42:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
On Sunday, June 27, 2021 12:04:53 AM CEST Michael wrote:
> Hi,
> I found some unexpected results with sort -V. I hope this is the correct
> place to send a bug report to [1].
> They are caused by a bug in filevercmp inside gnulib, specifically in the
> function match_suffix.
> I assume it should, as documented, match a file ending as defined by this
> regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
> However, I found two cases where this does not happen:
> 1) Two consecutive dots. It is not checked if the character after a dot is
> a dot. This results in nothing being matched in a case like "a..a", even
> though it should match ".a" according to the regex.
> Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
> 2) A trailing dot. If there is no additional character after a dot, it is
> still matched (e.g. for "a." the . is matched).
> Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.
As far as I understand, regex (\.[A-Za-z~][A-Za-z0-9~]*)*$ specifies that each
dot has to be followed by [A-Za-z~] to be matched. Am I missing anything?
I am not saying that the current behavior is perfect (a solution that works as
expected in all scenarios is difficult to find in this case) but, at least, it
seems to me that it works as it is described.
Kamil
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 16:42:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 16:53:01 GMT)
Full text and
rfc822 format available.
Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):
Am 28.06.21 um 18:41 schrieb Kamil Dudka:
> On Sunday, June 27, 2021 12:04:53 AM CEST Michael wrote:
>> Hi,
>> I found some unexpected results with sort -V. I hope this is the correct
>> place to send a bug report to [1].
>> They are caused by a bug in filevercmp inside gnulib, specifically in the
>> function match_suffix.
>> I assume it should, as documented, match a file ending as defined by this
>> regex: /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
>> However, I found two cases where this does not happen:
>> 1) Two consecutive dots. It is not checked if the character after a dot is
>> a dot. This results in nothing being matched in a case like "a..a", even
>> though it should match ".a" according to the regex.
>> Testcase: printf "a..a\na.+" | sort -V # a..a should be before a.+ I think
>> 2) A trailing dot. If there is no additional character after a dot, it is
>> still matched (e.g. for "a." the . is matched).
>> Testcase: printf "a.\na+" | sort -V # I think a+ should be before a.
> As far as I understand, regex (\.[A-Za-z~][A-Za-z0-9~]*)*$ specifies that each
> dot has to be followed by [A-Za-z~] to be matched. Am I missing anything?
>
> I am not saying that the current behavior is perfect (a solution that works as
> expected in all scenarios is difficult to find in this case) but, at least, it
> seems to me that it works as it is described.
I was trying to say that the regex is not followed in two cases:
- when there are two dots followed by [A-Za-z~], the second dot should
be matched, but it is not.
An example is "foo..a": In this case ".a" should be matched, but it is
not (nothing is matched)
- when there's a trailing dot, the trailing dot is matched even though
it is not followed by anything
e.g. "foo." matches the "." as the file ending, but it should not match
a file ending in this case.
I hope it's clearer now,
Michael
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 16:53:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 17:55:01 GMT)
Full text and
rfc822 format available.
Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):
On Monday, June 28, 2021 6:52:14 PM CEST Michael Debertol wrote:
> I was trying to say that the regex is not followed in two cases:
>
> - when there are two dots followed by [A-Za-z~], the second dot should
> be matched, but it is not.
>
> An example is "foo..a": In this case ".a" should be matched, but it is
> not (nothing is matched)
>
> - when there's a trailing dot, the trailing dot is matched even though
> it is not followed by anything
>
> e.g. "foo." matches the "." as the file ending, but it should not match
> a file ending in this case.
>
> I hope it's clearer now,
>
> Michael
You are right. The matching algorithm was not implemented correctly and
the patch you attached fixes it. Sorry for missing it in my previous reply.
Kamil
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Mon, 28 Jun 2021 17:55:02 GMT)
Full text and
rfc822 format available.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sun, 13 Feb 2022 05:32:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Michael <michael.debertol <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 13 Feb 2022 05:32:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 49239-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 6/28/21 10:54, Kamil Dudka wrote:
> You are right. The matching algorithm was not implemented correctly and
> the patch you attached fixes it.
I looked into Bug#49239 and found some more places where the
documentation disagreed with the code. I installed the attached patches
into Gnulib and Coreutils, respectively, which should bring the two into
agreement and should fix the bugs that Michael reported albeit in a
different way than his proposed patch. Briefly:
* The code didn't allow file name suffixes to be the entire file name,
but the documentation did. Here I went with the documentation. I could
be talked into the other way; it shouldn't matter much either way.
* The code did the preliminary test (without suffixes) using strcmp, the
documentation said it should use version comparison. Here I went with
the documentation.
* As Michael mentioned, sort -V mishandled NUL. I fixed this by adding a
Gnulib function filenvercmp that treats NUL as just another character.
* As Michael also mentioned, filevercmp fell back on strcmp if version
sort found no difference, which meant sort's --stable flag was
ineffective. I fixed this by not having filevercmp fall back on strcmp.
* I fixed the two-consecutive dot and trailing-dot bugs Michael
mentioned, by rewriting the suffix finder to not have that confusing
READ_ALPHA state variable, and to instead implement the regular
expression's nested * operators in the usual way with nested loops.
Thanks, Michael, for reporting the problem. I'm boldly closing the
Coreutils bug report as fixed.
[0001-filevercmp-fix-several-unexpected-results.patch (text/x-patch, attachment)]
[0001-sort-fix-several-version-sort-problems.patch (text/x-patch, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49239
; Package
coreutils
.
(Sun, 13 Feb 2022 14:18:01 GMT)
Full text and
rfc822 format available.
Message #31 received at 49239 <at> debbugs.gnu.org (full text, mbox):
On 13/02/2022 05:31, Paul Eggert wrote:
> On 6/28/21 10:54, Kamil Dudka wrote:
>> You are right. The matching algorithm was not implemented correctly and
>> the patch you attached fixes it.
>
> I looked into Bug#49239 and found some more places where the
> documentation disagreed with the code. I installed the attached patches
> into Gnulib and Coreutils, respectively, which should bring the two into
> agreement and should fix the bugs that Michael reported albeit in a
> different way than his proposed patch. Briefly:
>
> * The code didn't allow file name suffixes to be the entire file name,
> but the documentation did. Here I went with the documentation. I could
> be talked into the other way; it shouldn't matter much either way.
>
> * The code did the preliminary test (without suffixes) using strcmp, the
> documentation said it should use version comparison. Here I went with
> the documentation.
>
> * As Michael mentioned, sort -V mishandled NUL. I fixed this by adding a
> Gnulib function filenvercmp that treats NUL as just another character.
>
> * As Michael also mentioned, filevercmp fell back on strcmp if version
> sort found no difference, which meant sort's --stable flag was
> ineffective. I fixed this by not having filevercmp fall back on strcmp.
>
> * I fixed the two-consecutive dot and trailing-dot bugs Michael
> mentioned, by rewriting the suffix finder to not have that confusing
> READ_ALPHA state variable, and to instead implement the regular
> expression's nested * operators in the usual way with nested loops.
>
> Thanks, Michael, for reporting the problem. I'm boldly closing the
> Coreutils bug report as fixed.
A very thorough analysis.
All looks good.
thank you!
Pádraig
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 14 Mar 2022 11:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 2 years and 15 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.