GNU bug report logs - #61884
add an option to du that allows to control which file types are counted

Previous Next

Package: coreutils;

Reported by: Christoph Anton Mitterer <calestyo <at> scientia.org>

Date: Wed, 1 Mar 2023 03:20:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 61884 in the body.
You can then email your comments to 61884 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Wed, 01 Mar 2023 03:20:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Christoph Anton Mitterer <calestyo <at> scientia.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 01 Mar 2023 03:20:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: bug-coreutils <at> gnu.org
Subject: add an option to du that allows to control which file types are
 counted
Date: Wed, 01 Mar 2023 04:18:34 +0100
Hey.

When I want to count the nominal sizes of the (usually regular) files
in a directory I do something like:

du --apparent-size --block-size=1

This however also counts in the sizes of the directories themselves
(and I guess also of symlinks, etc.).


The "problem" with that is in particular, that for the exact same
dir/file structure, the results differ e.g. between ext4 and btrfs,
because of different sizes for the directories (themselves).

It would be nice if there was a option that allowed to select which
file types are counted.


Yes I know that one can do something like:
find . -type f -print0  |  du --apparent-size -l -c -s --block-size=1 --files0-from=- | tail -n

But that's rather cumbersome... also I cannot do something like
du path1 path2 path3
and get totals for each and a grand summary.

And even if I make an shell alias out of this, I cannot do bash completion on it.


Thanks,
Chris.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Thu, 02 Mar 2023 16:02:01 GMT) Full text and rfc822 format available.

Message #8 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Christoph Anton Mitterer <calestyo <at> scientia.org>, 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Thu, 2 Mar 2023 16:01:05 +0000
On 01/03/2023 03:18, Christoph Anton Mitterer wrote:
> Hey.
> 
> When I want to count the nominal sizes of the (usually regular) files
> in a directory I do something like:
> 
> du --apparent-size --block-size=1
> 
> This however also counts in the sizes of the directories themselves
> (and I guess also of symlinks, etc.).
> 
> 
> The "problem" with that is in particular, that for the exact same
> dir/file structure, the results differ e.g. between ext4 and btrfs,
> because of different sizes for the directories (themselves).
> 
> It would be nice if there was a option that allowed to select which
> file types are counted.
> 
> 
> Yes I know that one can do something like:
> find . -type f -print0  |  du --apparent-size -l -c -s --block-size=1 --files0-from=- | tail -n
> 
> But that's rather cumbersome... also I cannot do something like
> du path1 path2 path3
> and get totals for each and a grand summary.
> 
> And even if I make an shell alias out of this, I cannot do bash completion on it.

There are many possible filtering options,
which are probably best left to `find` (as per your example).
This was also mentioned previously at:
https://lists.gnu.org/archive/html/coreutils/2013-04/msg00043.html

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Thu, 02 Mar 2023 16:56:02 GMT) Full text and rfc822 format available.

Message #11 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: Pádraig Brady <P <at> draigBrady.com>, 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which
 file types are counted
Date: Thu, 02 Mar 2023 17:54:09 +0100
On Thu, 2023-03-02 at 16:01 +0000, Pádraig Brady wrote:
> There are many possible filtering options,
> which are probably best left to `find` (as per your example).
> This was also mentioned previously at:
> https://lists.gnu.org/archive/html/coreutils/2013-04/msg00043.html


Sure, but the problem with all these is that one doesn't get usable
per-operand totals - only one big overall total.

If you take e.g.:


find dir1 dir2 fileA -not -type d -print0 | du -hsc --files0-from=-

(without the tail), one get's one line per (non-directory) file below
dir1 and dir2 as well as one for fileA .. plus the grand overall total,
whereas it would be nice to have totals for:
- dir1
- dir2
- fielA
- overall

Cheers,
Chris.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Thu, 02 Mar 2023 17:21:02 GMT) Full text and rfc822 format available.

Message #14 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Golden <gdg <at> zplane.com>
To: Christoph Anton Mitterer <calestyo <at> scientia.org>
Cc: 61884 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigbrady.com>
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Thu, 2 Mar 2023 10:20:35 -0700
Hi Christoph,

Christoph Anton Mitterer <calestyo <at> scientia.org> [2023-03-02 17:54:09 +0100]:
> On Thu, 2023-03-02 at 16:01 +0000, Pádraig Brady wrote:
> > There are many possible filtering options,
> > which are probably best left to `find` (as per your example).
> > This was also mentioned previously at:
> > https://lists.gnu.org/archive/html/coreutils/2013-04/msg00043.html
> 
> 
> Sure, but the problem with all these is that one doesn't get usable
> per-operand totals - only one big overall total.
> 
> If you take e.g.:
> 
> 
> find dir1 dir2 fileA -not -type d -print0 | du -hsc --files0-from=-
> 
> (without the tail), one get's one line per (non-directory) file below
> dir1 and dir2 as well as one for fileA .. plus the grand overall total,
> whereas it would be nice to have totals for:
> - dir1
> - dir2
> - fielA
> - overall
> 

Would something like this work for you?

    ----------------------------------------------------------------
    $ echo dir1_file1 > dir1/file1
    $ echo dir1_file2 > dir1/file2
    $ echo dir2_file1 > dir2/file1
    $ echo dir2_file2 > dir2/file2
    $ echo somefile > fileA

    $ find dir1 dir2 fileA -not -type d -print0 | xargs --null du -hsc
    4.0K    dir1/file2
    4.0K    dir1/file1
    4.0K    dir2/file2
    4.0K    dir2/file1
    4.0K    fileA
    20K     total
    ----------------------------------------------------------------

- Glenn




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Fri, 03 Mar 2023 00:22:01 GMT) Full text and rfc822 format available.

Message #17 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: gdg <at> zplane.com
Cc: 61884 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigbrady.com>
Subject: Re: bug#61884: add an option to du that allows to control which
 file types are counted
Date: Fri, 03 Mar 2023 01:21:37 +0100
Hey Glenn

On Thu, 2023-03-02 at 10:20 -0700, Glenn Golden wrote:
> Would something like this work for you?
> 
>     ----------------------------------------------------------------
>     $ echo dir1_file1 > dir1/file1
>     $ echo dir1_file2 > dir1/file2
>     $ echo dir2_file1 > dir2/file1
>     $ echo dir2_file2 > dir2/file2
>     $ echo somefile > fileA
> 
>     $ find dir1 dir2 fileA -not -type d -print0 | xargs --null du -
> hsc
>     4.0K    dir1/file2
>     4.0K    dir1/file1
>     4.0K    dir2/file2
>     4.0K    dir2/file1
>     4.0K    fileA
>     20K     total
>     ----------------------------------------------------------------

TBH, I don't even understand how this should solve the "problem" I've
described above.

Your find would stil return any non-directory files beneath dir1 and
dir2.
Because of xargs, du would see each of them as an argument (and likely
produce undesired results if there are too many files), and
subsequently still print each of them as a -s "total".


But apart from that,... it's clear that one can get the desired results
*somehow*, e.g. I simply use a scrip like that right now:


total_size=0
for pathname in "$@"; do
        size="$(  find "${pathname}" \! -type d -print0  |  du --apparent-size -l -c --block-size=1 --files0-from=-  |  tail -n 1  |  cut -d '  ' -f 1 )"
        total_size="$((  ${size} + ${total_size}  ))"
        
        printf '%s\t%s\n' "${size}" "${pathname}"
done
printf '%s\ttotal\n' "${total_size}"

# (with the -d ' ' being a literal tabulator - $'…' quoting is not (yet) POSIX standardised)



That gets of course ugly if one would have really a lot arguments (many
forked processes).
And it's not something that one can expect to be there per default.


Anyway,... feel free to close the issue.


Cheers,
Chris.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Fri, 03 Mar 2023 00:25:02 GMT) Full text and rfc822 format available.

Message #20 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which
 file types are counted
Date: Fri, 03 Mar 2023 01:24:48 +0100
Oh, and I forgot to mention another main drawback of such a script.

It cannot (easily) be used with du's other options, cause that would
require some options parser to be added to the script.

While this is of course rather easily possible (getopt) the main
problem there is IMO to keep it up2date with any option changes to du.


Cheers,
Chris.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Sat, 04 Mar 2023 22:59:02 GMT) Full text and rfc822 format available.

Message #23 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Christoph Anton Mitterer <calestyo <at> scientia.org>
Cc: 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Sat, 4 Mar 2023 14:58:25 -0800
What's the motivation here? Does this have something to do with 
reproducible builds?

One possibility is for --apparent-size to always count 0 for 
directories, since 'read' never returns a positive number on 
directories. That is, we reinterpret --apparent-size to mean "bytes that 
could be read" rather than "what st_size says".




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Sat, 04 Mar 2023 23:34:02 GMT) Full text and rfc822 format available.

Message #26 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which
 file types are counted
Date: Sun, 05 Mar 2023 00:33:00 +0100
On Sat, 2023-03-04 at 14:58 -0800, Paul Eggert wrote:
> What's the motivation here? Does this have something to do with 
> reproducible builds?

No, nothing with reproducibility - at least not from my side. It's
really just to get a number for the "actual" data. And yes it's clear
that one can argue what that actually is ;-) ... but at least I think
it should give the same totals for the same files (of any type) on any
filesystem.


> One possibility is for --apparent-size to always count 0 for 
> directories, since 'read' never returns a positive number on 
> directories. That is, we reinterpret --apparent-size to mean "bytes
> that 
> could be read" rather than "what st_size says".

Sounds like having a good potential for breaking existing stuff.

And in a way solve the fundamental problem only partially:

As said above, it's not even clear what "actual" or "pristine" data
should actually be.

I would say that it's at least independent of any underlying structures
(like meta data of a filesystem or e.g. header data in a tar archive).

But would symlinks (i.e. their length) count for it?
What about hardlinked files, would they count once or n times?


du already allows to select what it should do for hard links (-l) so I
figured it would fit conceptually if it would allow the same for file
types.
E.g. with a --type option that takes a string of (1-n) letter like
find:
              b      block (buffered) special
              c      character (unbuffered) special
              d      directory
              p      named pipe (FIFO)
              f      regular file
              l      symbolic link
              s      socket
              D      door (Solaris)

If --type is given only the files with letters are counted (but it has
no effect on whether such files are followed or recursed into (in the
case of d or l).


But anyway... as said previously... I already have my script that does
more or less what I want.
So if you think the whole idea is overkill for du, then don't hesitate
to close as wontfix.


Cheers,
Chris.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 05 Mar 2023 01:01:02 GMT) Full text and rfc822 format available.

Notification sent to Christoph Anton Mitterer <calestyo <at> scientia.org>:
bug acknowledged by developer. (Sun, 05 Mar 2023 01:01:02 GMT) Full text and rfc822 format available.

Message #31 received at 61884-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Christoph Anton Mitterer <calestyo <at> scientia.org>
Cc: 61884-done <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Sat, 4 Mar 2023 17:00:25 -0800
[Message part 1 (text/plain, inline)]
On 2023-03-04 15:33, Christoph Anton Mitterer wrote:

> But would symlinks (i.e. their length) count for it?

Sure, because you can read symlinks by using readlink, and that gives 
you their lengths.

Come to think of it, POSIX specifies st_size only for regular files and 
symlinks among the files you'll find in a directory. So du --apparent 
should count st_size only for these file types; it should ignore st_size 
for other file types unless we know somehow that those sizes make sense 
(which for directories is problematic for the reasons you mention).


> What about hardlinked files, would they count once or n times?

That's an independent axis and is handled by -l. Hard links are not a 
file type.


>                b      block (buffered) special
>                c      character (unbuffered) special
>                d      directory
>                p      named pipe (FIFO)
>                f      regular file
>                l      symbolic link
>                s      socket
>                D      door (Solaris)

I expect Coreutils's already-existing usable_st_function should tell us 
which types have usable st_size. This will exclude directories, which 
should be the right thing for your use case.


So I installed the attached patch to fix du --apparent to count sizes 
only when st_size is well-defined. This should address your use case so 
I'm boldly closing the bug report.
[0001-du-apparent-counts-only-symlinks-and-regular.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Sun, 05 Mar 2023 01:22:02 GMT) Full text and rfc822 format available.

Message #34 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Christoph Anton Mitterer <calestyo <at> scientia.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which
 file types are counted
Date: Sun, 05 Mar 2023 02:20:52 +0100
Hey Paul.


On Sat, 2023-03-04 at 17:00 -0800, Paul Eggert wrote:
> 
> So I installed the attached patch

AFAICS this is now only documented in the info page?

Would you mind to add a shorter notice to the manpage as well?


Thanks,
Chris.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Sun, 05 Mar 2023 02:14:02 GMT) Full text and rfc822 format available.

Message #37 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Christoph Anton Mitterer <calestyo <at> scientia.org>
Cc: 61884 <at> debbugs.gnu.org
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Sat, 4 Mar 2023 18:13:25 -0800
On 2023-03-04 17:20, Christoph Anton Mitterer wrote:
> Would you mind to add a shorter notice to the manpage as well?

The manpage is terse by design, and I doubt whether this minor detail 
makes the cut.




Information forwarded to bug-coreutils <at> gnu.org:
bug#61884; Package coreutils. (Mon, 13 Mar 2023 15:27:02 GMT) Full text and rfc822 format available.

Message #40 received at 61884 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 61884 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, calestyo <at> scientia.org
Subject: Re: bug#61884: add an option to du that allows to control which file
 types are counted
Date: Mon, 13 Mar 2023 15:26:14 +0000
[Message part 1 (text/plain, inline)]
On 05/03/2023 01:00, Paul Eggert wrote:
> On 2023-03-04 15:33, Christoph Anton Mitterer wrote:
> 
>> But would symlinks (i.e. their length) count for it?
> 
> Sure, because you can read symlinks by using readlink, and that gives
> you their lengths.
> 
> Come to think of it, POSIX specifies st_size only for regular files and
> symlinks among the files you'll find in a directory. So du --apparent
> should count st_size only for these file types; it should ignore st_size
> for other file types unless we know somehow that those sizes make sense
> (which for directories is problematic for the reasons you mention).
> 
> 
>> What about hardlinked files, would they count once or n times?
> 
> That's an independent axis and is handled by -l. Hard links are not a
> file type.
> 
> 
>>                 b      block (buffered) special
>>                 c      character (unbuffered) special
>>                 d      directory
>>                 p      named pipe (FIFO)
>>                 f      regular file
>>                 l      symbolic link
>>                 s      socket
>>                 D      door (Solaris)
> 
> I expect Coreutils's already-existing usable_st_function should tell us
> which types have usable st_size. This will exclude directories, which
> should be the right thing for your use case.
> 
> 
> So I installed the attached patch to fix du --apparent to count sizes
> only when st_size is well-defined. This should address your use case so
> I'm boldly closing the bug report.

The attached adjusts the du/threshold test to pass
by avoiding testing --apparent with dirs

cheers,
Pádraig
[du--app-dir-test.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 11 Apr 2023 11:24:12 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 15 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.