GNU bug report logs - #33371
RFC: option for numeric sort: ignore-non-numeric characters

Previous Next

Package: coreutils;

Reported by: L A Walsh <coreutils <at> tlinx.org>

Date: Wed, 14 Nov 2018 02:34:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 33371 in the body.
You can then email your comments to 33371 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Wed, 14 Nov 2018 02:34:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to L A Walsh <coreutils <at> tlinx.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 14 Nov 2018 02:34:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <coreutils <at> tlinx.org>
To: Coreutils <bug-coreutils <at> gnu.org>
Subject: RFC: option for numeric sort: ignore-non-numeric characters
Date: Tue, 13 Nov 2018 18:32:55 -0800
I have a bunch of files numbered from 1-over 2000 without leading zeros
(think rfc's)...
They have names with a non-numeric prefix & suffix around the number.

It would be nice if sort had the option to ignore non-numeric
data and only sort on the numeric data in the 'lines'/'files'.

Yeah, I can renumber and rename them all, but I just wanted
an instant command that could sort numeric values even if embedded
in a line, where the "field" was determined by the start/stop of
numeric characters.

Or is there an options for this already, and my manpage out of date?

Thx
-l





Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Wed, 14 Nov 2018 02:45:01 GMT) Full text and rfc822 format available.

Message #8 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: L A Walsh <coreutils <at> tlinx.org>, 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Tue, 13 Nov 2018 20:44:32 -0600
On 11/13/18 8:32 PM, L A Walsh wrote:
> I have a bunch of files numbered from 1-over 2000 without leading zeros
> (think rfc's)...
> They have names with a non-numeric prefix & suffix around the number.
> 
> It would be nice if sort had the option to ignore non-numeric
> data and only sort on the numeric data in the 'lines'/'files'.
> 
> Yeah, I can renumber and rename them all, but I just wanted
> an instant command that could sort numeric values even if embedded
> in a line, where the "field" was determined by the start/stop of
> numeric characters.
> 
> Or is there an options for this already, and my manpage out of date?

Without ACTUAL data to experiment with, it's much harder for anyone else 
to propose a solution that will work with your specific data.

But one quick approach comes to mind: decorate-sort-undecorate:

sed 's/^\([^0-9]*\)\([0-9]*\)/\2 \1\2/' < myinput \
  | sort -k1,1n | sed 's/^[0-9]* //' > myoutput

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Wed, 14 Nov 2018 08:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: Erik Auerswald <auerswal <at> unix-ag.uni-kl.de>
To: L A Walsh <coreutils <at> tlinx.org>
Cc: 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Wed, 14 Nov 2018 09:27:20 +0100
Hi,

On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote:
> I have a bunch of files numbered from 1-over 2000 without leading zeros
> (think rfc's)...
> They have names with a non-numeric prefix & suffix around the number.

Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt.

> It would be nice if sort had the option to ignore non-numeric
> data and only sort on the numeric data in the 'lines'/'files'.

Perhaps --version-sort could work for you?

$ for r in rfc{1..100}.txt; do echo "$r"; done | sort | sort -V

(The first sort un-sorts the sorted input data, the seconds sorts it
again.)

> [...]
> Or is there an options for this already, and my manpage out of date?

AFAIK not exactly.

Thanks,
Erik
-- 
It's impossible to learn very much by simply sitting in a lecture,
or even by simply doing problems that are assigned.
                        -- Richard P. Feynman




Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Thu, 15 Nov 2018 06:26:02 GMT) Full text and rfc822 format available.

Message #14 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <coreutils <at> tlinx.org>
To: Eric Blake <eblake <at> redhat.com>, 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Wed, 14 Nov 2018 22:24:58 -0800

On 11/13/2018 6:44 PM, Eric Blake wrote:
> On 11/13/18 8:32 PM, L A Walsh wrote:
>> I have a bunch of files numbered from 1-over 2000 without leading zeros
>> (think rfc's)...
>> They have names with a non-numeric prefix & suffix around the number.
>>
>> It would be nice if sort had the option to ignore non-numeric
>> data and only sort on the numeric data in the 'lines'/'files'.
>>
>> Yeah, I can renumber and rename them all, but I just wanted
>> an instant command that could sort numeric values even if embedded
>> in a line, where the "field" was determined by the start/stop of
>> numeric characters.
>>
>> Or is there an options for this already, and my manpage out of date?
> 
> Without ACTUAL data to experiment with, it's much harder for anyone else 
> to propose a solution that will work with your specific data.
----
	...think rfcs...um have you ever looked at the directory 
with a bunch (all or most) rfc in it?


> 
> But one quick approach comes to mind: decorate-sort-undecorate:
> 
> sed 's/^\([^0-9]*\)\([0-9]*\)/\2 \1\2/' < myinput \
>    | sort -k1,1n | sed 's/^[0-9]* //' > myoutput

----
	That does work, but still seems a bit odd on a numeric
sort not to have it, even by default, ignore non-numeric data in front or after.

	I may be imagining this, but I though I'd seen some version of sort
that did this -- simply skipping the non numeric characters and sorting on the
numbers.

	Instead this sort reverted to alpha sort.  Thinking about
it...if I ask for numeric sort, shouldn't it at least try to look for
numbers in each line to sort them?

	That seems like it might be a user-friendly and even consistent
thing to do, considering there are options to
1) ignore leading blanks
2) ignore case
3) ignore nonprinting... ( this most close parallels the request, since when 
	when doing an alpha sort, one might hope it could ignore what isn't 
	visible).
4) "human sort" --- actually this option sorta makes it look like a
bug, since this sort ignores things that don't look like a number+suffix).
So why wouldn't numeric sort do the same?

I'd even sorta hoped the -h sort might work for this... since
if you were showing sizes, and only had values in 'bytes', you wouldn't see
the suffixes.  So I'd hoped that it would order 
rfc98.txt before rfc979.txt, but such is not the case.

I.e. in the case of 'ls', it ignores junk before and after the 
numbers+optional unit).  So one might wonder why it doesn't properly 
sort the numbers with 'rfc' before them and '.txt' after them.  

I.e. should 4 have worked maybe?  Might be a bit perverse, but 
can't see why not.



	
> 




Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Mon, 19 Nov 2018 01:09:01 GMT) Full text and rfc822 format available.

Message #17 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <coreutils <at> tlinx.org>
To: Erik Auerswald <auerswal <at> unix-ag.uni-kl.de>
Cc: 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Sun, 18 Nov 2018 17:08:00 -0800

On 11/14/2018 12:27 AM, Erik Auerswald wrote:
> Hi,
> 
> On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote:
>> I have a bunch of files numbered from 1-over 2000 without leading zeros
>> (think rfc's)...
>> They have names with a non-numeric prefix & suffix around the number.
> 
> Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt.
> 
>> It would be nice if sort had the option to ignore non-numeric
>> data and only sort on the numeric data in the 'lines'/'files'.
> 
> Perhaps --version-sort could work for you?
> 
> $ for r in rfc{1..100}.txt; do echo "$r"; done | sort | sort -V
> 
> (The first sort un-sorts the sorted input data, the seconds sorts it
> again.)
-----
	Tried this... had initial "turn-off" with using a for loop to
list files when '/bin/ls -1 *.txt' was so much shorter.  However, just
the 'sort -V' works by itself, works.

I'm not sure exactly why, but that wasn't initially clear to
me, though, maybe should have been, having written version-sort
more than once before. 

Minor gotchas, using single numbers, the for loop produced:
rfc1.txt
rfc2.txt
rfc3.txt
rfc4.txt
rfc5.txt
rfc6.txt
rfc7.txt
rfc8.txt
rfc9.txt

while the '/bin/ls -1 rfc?.txt|sort -V' algorithm produced:
rfc1.txt
rfc2.txt
rfc3.txt
rfc4.txt
rfc5.txt
rfc6.txt
----
(7-9 didn't exist in the directory)

>> [...]
>> Or is there an options for this already, and my manpage out of date?
> 
> AFAIK not exactly.
> 
> Thanks,
> Erik
----
	"-V" seems like it might be sufficient, but I doubt most
non-computer types would know that -V would sort multiple numeric fields
separated by invariant non-numeric characters in a numeric fashion
(or would even know how a version sort is the other sorts).

Given how well read docs are these days, almost need a literal definition
of 'version sort' besides just calling it a 'version sort' (which we
must admit, is 'jargon'). Along the lines of:

   --version-sort |  -V 
      Sees inputs as a mix of numeric and alphabetic (or "identifier")
      fields, where the numeric fields are sorted naturally, and alpha
      fields sorted alphabetically.  Fields may have separators like
      '.', '_', or '-',  sometimes constrained by a specific computer
      language, or may have no separator at all between numeric and
      alpha fields.  This is type of sort is often called a 
      "version sort" in the computer field.

???  I listed 'version sort' at the end, as the equivalence so those who tend
to skip and read initial parts of lines/paragraphs would not just see 
"version sort" and gloss over the rest, inserting their own equivalence
for the definition -- especially likely w/"version-sort" being the long form
of the switch.










Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Mon, 19 Nov 2018 14:28:02 GMT) Full text and rfc822 format available.

Message #20 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: Erik Auerswald <auerswal <at> unix-ag.uni-kl.de>
To: L A Walsh <coreutils <at> tlinx.org>
Cc: 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Mon, 19 Nov 2018 15:27:00 +0100
Hi,

On 11/19/18 02:08, L A Walsh wrote:
> On 11/14/2018 12:27 AM, Erik Auerswald wrote:
>> On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote:
>>> I have a bunch of files numbered from 1-over 2000 without leading zeros
>>> (think rfc's)...
>>> They have names with a non-numeric prefix & suffix around the number.
>>
>> Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt.
>>
>>> It would be nice if sort had the option to ignore non-numeric
>>> data and only sort on the numeric data in the 'lines'/'files'.
>>
>> Perhaps --version-sort could work for you?
>> [...]
> the 'sort -V' works by itself, works.
> [...]
>>> Or is there an options for this already, and my manpage out of date?
>>
>> AFAIK not exactly.
>> [...]
>      "-V" seems like it might be sufficient, but I doubt most
> non-computer types would know that -V would sort multiple numeric fields
> separated by invariant non-numeric characters in a numeric fashion
> (or would even know how a version sort is the other sorts).

As far as I remember, the definition of --version-sort is to follow the 
Debian GNU/Linux package version sorting rules. Those are based on 
numbers surrounded by text, but several characters have special meaning 
(e.g. '~' sorts before everything else, even before the empty string). 
Thus this is _not_ a "natural sort," but quite specific and potentially 
surprising.

$ printf -- 'foo\nbar\nfoo-bar\nfoo~bar\n' | sort --version-sort
bar
foo~bar
foo
foo-bar

> Given how well read docs are these days, almost need a literal definition
> of 'version sort' besides just calling it a 'version sort' (which we
> must admit, is 'jargon').

I think is worse than jargon, because it is specific to one kind of 
version numbering scheme.

> Along the lines of:
> 
>     --version-sort |  -V       Sees inputs as a mix of numeric and 
> alphabetic (or "identifier")
>        fields, where the numeric fields are sorted naturally, and alpha
>        fields sorted alphabetically.  Fields may have separators like
>        '.', '_', or '-',  sometimes constrained by a specific computer
>        language, or may have no separator at all between numeric and
>        alpha fields.  This is type of sort is often called a       
> "version sort" in the computer field.

Thus I am not sure about your suggestion above. :-/

> ???  I listed 'version sort' at the end, as the equivalence so those who 
> tend
> to skip and read initial parts of lines/paragraphs would not just see 
> "version sort" and gloss over the rest, inserting their own equivalence
> for the definition -- especially likely w/"version-sort" being the long 
> form
> of the switch.

I like that strategy. :-)

Thanks,
Erik




Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Thu, 13 Dec 2018 20:51:02 GMT) Full text and rfc822 format available.

Message #23 received at 33371 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: L A Walsh <coreutils <at> tlinx.org>,
 Erik Auerswald <auerswal <at> unix-ag.uni-kl.de>
Cc: 33371 <at> debbugs.gnu.org
Subject: Re: bug#33371: RFC: option for numeric sort: ignore-non-numeric
 characters
Date: Thu, 13 Dec 2018 13:50:26 -0700
tags 33371 notabug
close 33371
stop

Hello,

On 2018-11-18 6:08 p.m., L A Walsh wrote:
> 
> On 11/14/2018 12:27 AM, Erik Auerswald wrote:
>>
>> Perhaps --version-sort could work for you?
>>
> ----
>      "-V" seems like it might be sufficient, 
Given the above, I'm closing this item.

regards,
 - assaf




Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 13 Dec 2018 20:51:03 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 33371 <at> debbugs.gnu.org and L A Walsh <coreutils <at> tlinx.org> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 13 Dec 2018 20:51:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#33371; Package coreutils. (Wed, 19 Dec 2018 21:46:01 GMT) Full text and rfc822 format available.

Message #30 received at 33371-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Xu Chunyang <mail <at> xuchunyang.me>
Cc: 33371-done <at> debbugs.gnu.org
Subject: 26.1; cl-make-random-state copying not working
Date: Wed, 19 Dec 2018 13:45:15 -0800
[Message part 1 (text/plain, inline)]
Thanks for reporting this bug; it is a regression introduced when we 
separated records from vectors. I installed the attached patch into the 
emacs-26 branch.

[0001-cl-make-random-state-was-not-copying-its-arg.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 17 Jan 2019 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 94 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.