GNU bug report logs - #10281
du: hard-links counting with multiple arguments (commit

Previous Next

Package: coreutils;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Mon, 12 Dec 2011 18:02:02 UTC

Severity: wishlist

Tags: wontfix

Merged with 10282, 11526

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10281 in the body.
You can then email your comments to 10281 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 12 Dec 2011 18:02:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Eggert <eggert <at> cs.ucla.edu>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 12 Dec 2011 18:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Kamil Dudka <kdudka <at> redhat.com>
Cc: bug-coreutils <at> gnu.org
Subject: Re: change in behavior of du with multiple arguments (commit efe53cc)
Date: Mon, 12 Dec 2011 10:00:19 -0800
On 12/12/11 04:50, Kamil Dudka wrote:
> Was such a change in behavior intended?  I am asking as I was not able to
> find it documented anywhere.

It was intended, as it provides useful functionality that
can't be done if hard links aren't tracked across arguments,
whereas the reverse isn't true.  It's documented in
<http://www.gnu.org/software/coreutils/manual/coreutils.html#du-invocation>,
which says:

  If two or more hard links point to the same file,
  only one of the hard links is counted. The file
  argument order affects which links are counted,
  and changing the argument order may change the
  numbers that du outputs.

Perhaps this isn't sufficiently clear, and if so,
suggestions for improvements are welcome.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 12 Dec 2011 18:12:02 GMT) Full text and rfc822 format available.

Message #8 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Wade Stebbings <wade <at> min.ascend.com>, Kamil Dudka <kdudka <at> redhat.com>,
	10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 12 Dec 2011 19:09:56 +0100
Paul Eggert wrote:
> On 12/12/11 04:50, Kamil Dudka wrote:
>> Was such a change in behavior intended?  I am asking as I was not able to
>> find it documented anywhere.
>
> It was intended, as it provides useful functionality that
> can't be done if hard links aren't tracked across arguments,
> whereas the reverse isn't true.  It's documented in
> <http://www.gnu.org/software/coreutils/manual/coreutils.html#du-invocation>,
> which says:
>
>   If two or more hard links point to the same file,
>   only one of the hard links is counted. The file
>   argument order affects which links are counted,
>   and changing the argument order may change the
>   numbers that du outputs.
>
> Perhaps this isn't sufficiently clear, and if so,
> suggestions for improvements are welcome.

FYI, Kamil's original mail never to have reached the mailing list[*],
in spite of reaching debbugs and acquiring a bug number and then going
on to reach Paul (the Cc'd recipient).

Kamil or Paul, would you please post the original, for the record?

Jim

[*] Ward Vandewege confirmed that the message reached debbugs but
somehow was not passed on to eggs.gnu.org.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 12 Dec 2011 18:32:01 GMT) Full text and rfc822 format available.

Message #11 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Kamil Dudka <kdudka <at> redhat.com>, 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 12 Dec 2011 10:30:26 -0800
On 12/12/11 10:09, Jim Meyering wrote:

> Kamil or Paul, would you please post the original, for the record?

Sure, here's a copy:

From: Kamil Dudka <kdudka <at> redhat.com>
To: bug-coreutils <at> gnu.org
Subject: change in behavior of du with multiple arguments (commit efe53cc)
Date: Mon, 12 Dec 2011 13:50:30 +0100
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Message-Id: <201112121350.30539.kdudka <at> redhat.com>

Hi,

the following upstream commit introduces a major change in behavior of du
when multiple arguments are specified:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=efe53cc

... and the issue has landed as a bug in our Bugzilla:

https://bugzilla.redhat.com/747075#c3

Was such a change in behavior intended?  I am asking as I was not able to
find it documented anywhere.  The up2date man page states:

    Summarize disk usage of each FILE, recursively for directories.

..., where FILE refers to a single argument given to du.  The info 
documentation states:

    The FILE argument order affects which links are counted, and changing the
    argument order may change the numbers that `du' outputs.

However, changing the numbers is one thing and missing lines in the output
of du is quite another thing.

Could anybody please clarify the current behavior of du?  Thanks in advance!

Kamil




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 12 Dec 2011 20:32:01 GMT) Full text and rfc822 format available.

Message #14 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 10281 <at> debbugs.gnu.org, Wade Stebbings <wade <at> min.ascend.com>,
	Kamil Dudka <kdudka <at> redhat.com>
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 12 Dec 2011 13:30:30 -0700
Jim Meyering wrote:
> FYI, Kamil's original mail never to have reached the mailing list[*],

It was sitting in the debbugs-submit queue waiting for a human.  I
reviewed the queues a few minutes ago and sent it through.  At least I
am pretty sure it was the same message I saw there.  I hadn't realized
it was something to note until after I read the thread here just now.
Here is the interesting part of the trail.

  Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
          by debbugs.gnu.org with esmtp (Exim 4.69)
          (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
          id 1RaC73-0002OB-Uh
          for submit <at> debbugs.gnu.org; Mon, 12 Dec 2011 15:04:40 -0500
  Received: from eggs.gnu.org ([140.186.70.92])
          by debbugs.gnu.org with esmtp (Exim 4.69)
          (envelope-from <kdudka <at> redhat.com>) id 1Ra5Mj-0000gC-LJ
          for submit <at> debbugs.gnu.org; Mon, 12 Dec 2011 07:52:23 -0500

> in spite of reaching debbugs and acquiring a bug number

That does seem strange since I didn't think it got a bug number until
after it went through debbugs.  It had a bug number and so it must
have already gone through debbugs.  It must work differently from that
somehow.

> and then going on to reach Paul (the Cc'd recipient).

Of course the CC would be a direct message outside of any of the bug
tracking and mailing lists.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 12 Dec 2011 20:35:02 GMT) Full text and rfc822 format available.

Message #17 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Bob Proulx <bob <at> proulx.com>
Cc: Kamil Dudka <kdudka <at> redhat.com>, 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 12 Dec 2011 21:33:15 +0100
Bob Proulx wrote:
> Jim Meyering wrote:
>> FYI, Kamil's original mail never to have reached the mailing list[*],
>
> It was sitting in the debbugs-submit queue waiting for a human.  I
> reviewed the queues a few minutes ago and sent it through.  At least I
> am pretty sure it was the same message I saw there.  I hadn't realized
> it was something to note until after I read the thread here just now.
> Here is the interesting part of the trail.
>
>   Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
>           by debbugs.gnu.org with esmtp (Exim 4.69)
>           (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
>           id 1RaC73-0002OB-Uh
>           for submit <at> debbugs.gnu.org; Mon, 12 Dec 2011 15:04:40 -0500
>   Received: from eggs.gnu.org ([140.186.70.92])
>           by debbugs.gnu.org with esmtp (Exim 4.69)
>           (envelope-from <kdudka <at> redhat.com>) id 1Ra5Mj-0000gC-LJ
>           for submit <at> debbugs.gnu.org; Mon, 12 Dec 2011 07:52:23 -0500
>
>> in spite of reaching debbugs and acquiring a bug number
>
> That does seem strange since I didn't think it got a bug number until
> after it went through debbugs.  It had a bug number and so it must
> have already gone through debbugs.  It must work differently from that
> somehow.
>
>> and then going on to reach Paul (the Cc'd recipient).
>
> Of course the CC would be a direct message outside of any of the bug
> tracking and mailing lists.

Thanks, Bob.
I forgot about the debbugs queue.
I checked only the bug-coreutils mailman admin queue.




Forcibly Merged 10281 10282. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 13 Dec 2011 00:27:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Wed, 14 Dec 2011 09:05:02 GMT) Full text and rfc822 format available.

Message #22 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <elliott.forney <at> gmail.com>
To: 10281 <at> debbugs.gnu.org
Subject: change in behavior of du with multiple arguments (commit efe53cc)
Date: Tue, 13 Dec 2011 13:25:10 -0700
I think everyone is missing a subtle point here.  If I run "du -s a
a/b" then there are in fact NOT two hard links to the same file (Linux
doesn't even allow hard linked directories).  Rather, there are two
command line arguments pointing to two directories that are not
mutually exclusive.  This is a subtle but important difference and I
think POSIX is being misinterpreted/misused here.

Personally, I find the new behavior to be counter intuitive and I have
talked to several confused users about this change.  I have also seen
at least one script break.

Also, as Eric noted, the du implementations in Solaris, OSX and AIX
(all of which are POSIX compliant) give the same output for "du -s a
a/b" and "du -s a; du -s a/b"




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Wed, 14 Dec 2011 19:05:01 GMT) Full text and rfc822 format available.

Message #25 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Elliott Forney <elliott.forney <at> gmail.com>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Wed, 14 Dec 2011 11:03:03 -0800
On 12/13/11 12:25, Elliott Forney wrote:
> If I run "du -s a
> a/b" then there are in fact NOT two hard links to the same file (Linux
> doesn't even allow hard linked directories).

The intent of the POSIX spec is that files should be counted only once,
regardless of whether they are arrived at via hard links, or by following
symbolic links with -L, or by any other means.  One does not need to
appeal to hard-linked directories to run into the issue.  For example:

  $ mkdir d
  $ echo foo >f
  $ ln f d/f
  $ du d f
  2 d
  1 f

It's hard to argue that POSIX allows this behavior, even though it's what
Solaris 10 du does.

> the du implementations in Solaris, OSX and AIX
> (all of which are POSIX compliant)

No, I don't think they conform to POSIX in the above example.
Perhaps this is a bug in POSIX, of course, but there is a
good argument for why GNU du behaves the way it does: you get
useful behavior that you cannot get easily with the Solaris
du behavior.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Wed, 14 Dec 2011 21:56:01 GMT) Full text and rfc822 format available.

Message #28 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
To: 10281 <at> debbugs.gnu.org
Cc: Elliott Forney <elliott.forney <at> gmail.com>
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Wed, 14 Dec 2011 16:54:19 -0500 (GMT+5)
Paul Eggert writes:
> 
> Perhaps this is a bug in POSIX, of course, but there is a
> good argument for why GNU du behaves the way it does: you get
> useful behavior that you cannot get easily with the Solaris
> du behavior.
> 

Remind us again... the "useful behavior" is that du -s returns a column of
numbers next to a column of names, and the numbers don't necessarily have any
individual meaning relevant to the adjacent names, but you can add them up
manually and get something that is correct total for the group.

Meanwhile if you wanted the total for the group you would have used -c and
not had to add them up manually.

Why not let the -c total be correct *and* the -s individual numbers also be
correct for the names they are next to? Like this:

$ mkdir a b ; echo hello > a/a ; ln a/a b/b ; du -cs a b
8       a
8       b
12      total

The fact that the numbers on the left don't add up means there is less
redundancy in the output. Each number actually tells me something you can't
derive from the others. There is higher information content. This is good,
not bad.

-- 
Alan Curry




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Wed, 14 Dec 2011 22:16:02 GMT) Full text and rfc822 format available.

Message #31 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Wed, 14 Dec 2011 15:13:32 -0700
Alan Curry wrote:
> Why not let the -c total be correct *and* the -s individual numbers also be
> correct for the names they are next to? Like this:
> 
> $ mkdir a b ; echo hello > a/a ; ln a/a b/b ; du -cs a b
> 8       a
> 8       b
> 12      total
> 
> The fact that the numbers on the left don't add up means there is less
> redundancy in the output. Each number actually tells me something you can't
> derive from the others. There is higher information content. This is good,
> not bad.

I like this idea.  +1

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Thu, 15 Dec 2011 01:49:02 GMT) Full text and rfc822 format available.

Message #34 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alan Curry <pacman-cu <at> kosh.dhis.org>
Cc: 10281 <at> debbugs.gnu.org, Elliott Forney <elliott.forney <at> gmail.com>
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Wed, 14 Dec 2011 17:46:29 -0800
On 12/14/11 13:54, Alan Curry wrote:
> Why not let the -c total be correct *and* the -s individual numbers also be
> correct for the names they are next to?

Well, for starters, because the individual numbers
*are* correct for the names they are next to, for a
reasonable definition of "correct".  If the working
directory has two subdirectories A and B in that order,
it's counterintuitive that "du ." should output different
numbers for B than "du A B" does.  It's more intuitive
if the two invocations of du output numbers that agree.
But the Solaris du semantics require that the
numbers must disagree if A and B share space.

The use case you gave works with two arguments, but
it loses useful information with three or more arguments.
For example, suppose I have a bunch of hard links
that all reside in three directories A, B, and C,
and I want to find out how much disk space
I'll reclaim by removing C.  (This is a common situation
with git clones, for example.)  With GNU du, I can run
"du -s A B C" and the output line labeled "C" will tell
me how much disk space I'll reclaim.  There's no easy way
to do this with Solaris du.

In contrast, GNU du can easily support the use case you
mentioned, even with more than two arguments:

  du -s A
  du -s B
  du -s C
  du -cs A B C | tail -1

so in this sense its semantics are more powerful than
those of Solaris du.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Thu, 15 Dec 2011 08:33:02 GMT) Full text and rfc822 format available.

Message #37 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: Bob Proulx <bob <at> proulx.com>, "10281 <at> debbugs.gnu.org"
	<10281 <at> debbugs.gnu.org>
Subject: RE: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Thu, 15 Dec 2011 09:30:36 +0100
Bob Proulx wrote:
> Alan Curry wrote:
> > Why not let the -c total be correct *and* the -s individual numbers also be
> > correct for the names they are next to? Like this:
> > 
> > $ mkdir a b ; echo hello > a/a ; ln a/a b/b ; du -cs a b
> > 8       a
> > 8       b
> > 12      total
> > 
> > The fact that the numbers on the left don't add up means there is less
> > redundancy in the output. Each number actually tells me something you can't
> > derive from the others. There is higher information content. This is good,
> > not bad.
> 
> I like this idea.  +1

I also like the idea, but is there already consensus on the starting
question: what to do with - I call it - stacked arguments?

$ mkdir -p d/d
$ echo foo > d/f
$ echo bar > d/d/g
$ ln d/f d/d/f
$ du -s d d/d

I omitted the result, but I think Paul's question
is the important one:
  "how much disk space I'll reclaim by removing ..."
Therefore, and from a user's perspective, I'd expect
du to count d/d again for the sum of the first argument d.

Is this covered by the proposal?
What would be the output for the above example?

Have a nice day,
Berny



Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Fri, 16 Dec 2011 23:35:01 GMT) Full text and rfc822 format available.

Message #40 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <idfah <at> cs.colostate.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Fri, 16 Dec 2011 13:46:47 -0700
> The intent of the POSIX spec is that files should be counted only once,
> regardless of whether they are arrived at via hard links, or by following
> symbolic links with -L, or by any other means.

I agree that symlinks and hard links and maybe even bind mounts or
whatever else should not be counted twice.  I do think, however, that
multiple command line arguments should be counted individually since
they were explicitly specified by the user.  At least by default.

My proposed solution would be the following:

By default, files with the same inode should only be counted once for
each command line argument.  This can already be overridden with
--count-links.  However, everything should be reset between command
line arguments so that multiple command line arguments are counted
individually.  As pointed out by Alan Curry, -c should be used to get
a correct total.

In addition to this, it would be nice if there were a command line
switch that allowed for files to only be counted once across all
command line arguments, i.e. a switch to enable the current behavior.

In this scheme, the GNU du retains compatibility with Solaris, AIX,
OSX, et cetra, but the user has the option to count multiple command
line arguments only once.  I do agree that this functionality can be
useful in some circumstances I just think it is a confusing default.

> No, I don't think they conform to POSIX in the above example.
> Perhaps this is a bug in POSIX

Of course, maybe the first thing to do is to get clarification on the
POSIX spec.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 00:38:02 GMT) Full text and rfc822 format available.

Message #43 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Elliott Forney <idfah <at> cs.colostate.edu>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Fri, 16 Dec 2011 17:36:09 -0700
[Message part 1 (text/plain, inline)]
On 12/16/2011 01:46 PM, Elliott Forney wrote:
> In this scheme, the GNU du retains compatibility with Solaris, AIX,
> OSX, et cetra, but the user has the option to count multiple command
> line arguments only once.  I do agree that this functionality can be
> useful in some circumstances I just think it is a confusing default.
> 
>> No, I don't think they conform to POSIX in the above example.
>> Perhaps this is a bug in POSIX
> 
> Of course, maybe the first thing to do is to get clarification on the
> POSIX spec.

OK, I'll take on that task, and post a link once I've submitted the bug
report.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 02:39:01 GMT) Full text and rfc822 format available.

Message #46 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
To: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Fri, 16 Dec 2011 21:36:41 -0500 (GMT+5)
Paul Eggert writes:
> 
> For example, suppose I have a bunch of hard links
> that all reside in three directories A, B, and C,
> and I want to find out how much disk space
> I'll reclaim by removing C.  (This is a common situation
> with git clones, for example.)  With GNU du, I can run
> "du -s A B C" and the output line labeled "C" will tell
> me how much disk space I'll reclaim.  There's no easy way
> to do this with Solaris du.

The straightforward method would be to simply the directory you intend to
remove and keep track of the discrepancy between st_nlink and how many links
you've seen.

I admit that this straightforward method isn't implemented in any standard
tool, but your way involves extra work by both du, which must traverse all
the other directories which might share files with the target directory; and
the user, who must somehow amass that list of directories ahead of time. As a
creative improvised use of pre-existing tools it's a good example, but as a
justification for an intentional feature, it's just too inefficient.

-- 
Alan Curry




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 04:20:02 GMT) Full text and rfc822 format available.

Message #49 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
Cc: Elliott Forney <idfah <at> cs.colostate.edu>, Paul Eggert <eggert <at> cs.ucla.edu>,
	10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Fri, 16 Dec 2011 21:18:11 -0700
[Message part 1 (text/plain, inline)]
On 12/16/2011 05:36 PM, Eric Blake wrote:
>>> Perhaps this is a bug in POSIX
>>
>> Of course, maybe the first thing to do is to get clarification on the
>> POSIX spec.
> 
> OK, I'll take on that task, and post a link once I've submitted the bug
> report.

Submitted, and I'll follow up later once this has been discussed in an
Austin Group meeting: http://austingroupbugs.net/view.php?id=527

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 05:12:01 GMT) Full text and rfc822 format available.

Message #52 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Don Cragun <dcragun <at> sonic.net>
Cc: 10281 <at> debbugs.gnu.org, austin-group-l <austin-group-l <at> opengroup.org>
Subject: Re: [1003.1(2008)/Issue 7 0000527]: du and files found via multiple
	command line arguments
Date: Fri, 16 Dec 2011 22:09:17 -0700
[Message part 1 (text/plain, inline)]
[cc'ing the coreutils bug id]

On 12/16/2011 09:33 PM, Don Cragun wrote:
> On Dec 16, 2011, at 8:15 PM, Austin Group Bug Tracker wrote:
>> The following issue has been SUBMITTED. 
>> ====================================================================== 
>> http://austingroupbugs.net/view.php?id=527 
>  ... ... ...
>> In all likelihood, the intent of the standard was to codify traditional
>> behavior where the hash for duplicate files is reset each time du
>> starts processing the next command line argument, and GNU du was
>> wrong for trying to take the standard too literally.  However, it was
>> pointed out that the GNU behavior of remembering duplicates across
>> multiple command line arguments does have a use not possible in the
>> traditional implementation: if a user has multiple directories, all
>> of which share some hard links, then only the GNU semantics make it
>> possible to see how much disk space will be reclaimed by removing
>> the one directory, by invoking 'du -s' with the directory to be
>> removed as the last argument.  Therefore, I'm presenting two options
>> for solving the conflict in the standard, although my preference
>> would be for option 1 (the GNU implementation is willing to change
>> its behavior to comply with option 1 by adding an extension option
>> to provide its current behavior of remembering links across
>> multiple command line arguments, and all other implementations
>> already comply with option 1).
> 
> If we go with option 1, what option would the GNU implementation
> use to specify the current GNU behavior?  Should option 1 be
> extended to include the new option?

I assume that GNU would initially prefer to go with extensions only
through a long option, so as not to prematurely burn a short option on
an unpopular feature and so as not to risk collisions with short option
extensions chosen by other implementations.  One of the thoughts in the
initial bug report against GNU would be converting the existing 'du
--count-links' long option into taking an optional argument, as in:

du --count-links         => short-hand for du --count-links=always
du --count-links=always  => no elision of multiples
du --count-links=once    => current default GNU behavior, elide links
across args
du --count-links=per-arg => traditional behavior, elide links, but only
within each argument

Right now, GNU has 'du -l' as a synonym for 'du --count-links', but if
--count-links gains an optional qualifier, -l would NOT have an optional
qualifier, but would only match --count-links=always.  Obviously, POSIX
won't standardize a long option.  But if POSIX deems the
--count-links=once behavior useful enough to standardize a new short
option for it, even if GNU is the only implementation currently
implementing that behavior, we have no problem assigning that short
option as a synonym for the proposed --count-links=once behavior.  Would
'-o' conflict with any existing implementation, as a mnemonic for 'once
across all arguments'?  Or would some other short option letter be a
better mnemonic?

I'm not sure whether coreutils will immediately switch to having 'du'
without --count-links always default to the POSIX behavior, or whether,
in GNU fashion, the existence of $POSIXLY_CORRECT in the environment
will affect the choice of defaulting between the compliant vs. the GNU
behavior, but that's an implementation choice for GNU that should not
affect the decision on the resolution of the POSIX bug.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 07:44:01 GMT) Full text and rfc822 format available.

Message #55 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alan Curry <pacman-cu <at> kosh.dhis.org>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Fri, 16 Dec 2011 23:41:20 -0800
On 12/16/11 18:36, Alan Curry wrote:
> The straightforward method would be to simply the directory you intend to
> remove and keep track of the discrepancy between st_nlink and how many links
> you've seen.

Sorry, I can't parse that.  But whatever it is, it sounds like you're
talking about what one could do with a program written in C, not with
either GNU or Solaris du.

> As a creative improvised use of pre-existing tools it's a good example, but as a
> justification for an intentional feature, it's just too inefficient.

I'm having trouble parsing that as well but will try to answer anyway. :-)

First, the use of 'du' in the way I'm describing
is not particularly creative or improvised.  I use it
often in link farms (i.e., directories containing many
multiply-linked files).  And it's no accident that Git encourages link
farms either: the Git maintainer is a former coworker of mine,
and even before Git existed we used link farms a lot during software
development, and needed tools like 'du' to work well in link farms,
and this is partly why GNU 'du' works the way it does.  In short,
what may have appeared to you to be an accidental use of 'du'
is actually a designed one.

Second, I don't see what efficiency has to do with this, because exactly
the same efficiency issue arises with Solaris du, when it is given
a different argument list.  With Solaris du, I can get essentially
the same output as GNU "du A B C" by temporarily modifying the file
system, as follows:

 $ mkdir tmp
 $ mv A B C tmp
 $ (cd tmp; du; mv A B C ..)
 $ rmdir tmp

Of course I'd never want to do that in an actual link farm: it's tricky
and brittle and could mess up currently-running builds.  But the point is that
GNU du is not being inefficient here, any more than Solaris du is.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 08:42:01 GMT) Full text and rfc822 format available.

Message #58 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
To: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Sat, 17 Dec 2011 03:39:55 -0500 (GMT+5)
Paul Eggert writes:
> 
> On 12/16/11 18:36, Alan Curry wrote:
> > The straightforward method would be to simply the directory you intend to
> > remove and keep track of the discrepancy between st_nlink and how many links
> > you've seen.
> 
> Sorry, I can't parse that.  But whatever it is, it sounds like you're
> talking about what one could do with a program written in C, not with
> either GNU or Solaris du.

Yes, I'm saying that du is just not the tool for this job, although you've
managed to twist it to fit.

The "predict free space after rm -rf foo" operation can be done without
searching other directories and without requiring the user to specify a list
of other directories that might contain links. What you do with du is kludgy
by comparison.

[...]
> Of course I'd never want to do that in an actual link farm: it's tricky
> and brittle and could mess up currently-running builds.  But the point is that
> GNU du is not being inefficient here, any more than Solaris du is.
> 

By comparison to a proper tool which doesn't do any unnecessary traversals of
extra directories, your use of du is slow and brittle (if the user forgets
an alternate directory containing a link, the result is wrong) and has only
the slight advantage of already being implemented.

Here's a working outline of the single-traversal method. I wouldn't suggest
that du should contain equivalent code. A single-purpose perl script, even
without pretty output formatting, feels clean enough to me. Since I've gone
to the trouble (not much) of writing it, I'll keep it as ~/bin/predict_rm_rf
for future use.

#!/usr/bin/perl -W
use strict;
use File::Find;

@ARGV or die "Usage: $0 directory [directory ...]\n";

my $total = 0;
my %pending = ();

File::Find::find({wanted => sub {
  my ($dev,$ino,$nlink,$blocks) = (lstat($_))[0,1,3,12];
  if(-d _ || $nlink==1) {
    $total += $blocks;
    return;
  }
  if($nlink == ++$pending{"$dev.$ino"}) {
    delete $pending{"$dev.$ino"};
    $total += $blocks;
  }
}}, @ARGV);

print "$total blocks would be freed by rm -rf @ARGV\n";
__END__

-- 
Alan Curry




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 17 Dec 2011 09:22:01 GMT) Full text and rfc822 format available.

Message #61 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "Alan Curry" <pacman-cu <at> kosh.dhis.org>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit
Date: Sat, 17 Dec 2011 10:20:09 +0100
Alan Curry wrote:
...
> By comparison to a proper tool which doesn't do any unnecessary traversals of
> extra directories, your use of du is slow and brittle (if the user forgets
> an alternate directory containing a link, the result is wrong) and has only
> the slight advantage of already being implemented.
>
> Here's a working outline of the single-traversal method. I wouldn't suggest
> that du should contain equivalent code. A single-purpose perl script, even
> without pretty output formatting, feels clean enough to me. Since I've gone
> to the trouble (not much) of writing it, I'll keep it as ~/bin/predict_rm_rf
> for future use.
>
> #!/usr/bin/perl -W
> use strict;
> use File::Find;
>
> @ARGV or die "Usage: $0 directory [directory ...]\n";
>
> my $total = 0;
> my %pending = ();
>
> File::Find::find({wanted => sub {
>   my ($dev,$ino,$nlink,$blocks) = (lstat($_))[0,1,3,12];
>   if(-d _ || $nlink==1) {
>     $total += $blocks;
>     return;
>   }
>   if($nlink == ++$pending{"$dev.$ino"}) {
>     delete $pending{"$dev.$ino"};
>     $total += $blocks;
>   }
> }}, @ARGV);
>
> print "$total blocks would be freed by rm -rf @ARGV\n";

That seems useful.
However, the number it prints is too large whenever it processes
a file or directory more than $nlink times, e.g., when invoked as

    predict_rm_rf F F

it prints double the correct number.

To account for that, the script must record every dev/ino pair
it processes, say via:

    File::Find::find({wanted => sub {
      my ($dev,$ino,$nlink,$blocks) = (lstat($_))[0,1,3,12];
      defined $pending{"$dev.$ino"} && $pending{"$dev.$ino"} < 0
        and return;

      if(-d _ || $nlink==1 || $nlink == ++$pending{"$dev.$ino"}) {
        $total += $blocks;
        $pending{"$dev.$ino"} = -1;
        return;
      }
    }}, @ARGV);

Note that for a large tree, the perl code will be far less efficient
than C code like du because:

  - the perl script must call lstat for every single entry (du can
    use dirent.d_ino on some file systems).  When I checked about a year
    ago, Perl still had no good way to get something like dirent.d_ino.
  - du uses a compact representation for a device/inode pair, so
    may use a lot less memory.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sun, 18 Dec 2011 22:06:02 GMT) Full text and rfc822 format available.

Message #64 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>
Cc: Don Cragun <dcragun <at> sonic.net>, 10281 <at> debbugs.gnu.org,
	austin-group-l <austin-group-l <at> opengroup.org>
Subject: Re: [1003.1(2008)/Issue 7 0000527]: du and files found via multiple
	command line arguments
Date: Sun, 18 Dec 2011 14:03:49 -0800
Eric Blake's Option 1 does not appear to be tenable, as du
traditionally preserved hashes of duplicate files across all
of its operands.  7th Edition Unix 'du' did that, and (as
Jilles Tjoelker pointed out) so do at least two current 'du'
implementations, namely, FreeBSD and GNU.

The idea behind Eric's Option 2 is better, but its wording
is unclear partly because of another issue Jilles raised:
whether a file's disk space should be counted multiple times
if the file occurs multiple times and its link count is 1.
For example:

  mkdir d
  cd d
  cp /bin/sh w
  cp w y
  ln y ../y
  ln -s w x
  ln -s y z
  du -aL

This analyzes a directory with two regular files, 'w' and
'y'.  GNU and Solaris du count these files once each, with
an accurate sum of non-symlink disk usage under the current
directory.  But w's link count is 1 so FreeBSD counts 'w'
twice, thus overcounting disk usage.

The current POSIX wording does not say what to do for this
example, but the intent is to avoid overcounting disk usage,
and the GNU and Solaris behavior supports this intent better.
(The 7th Edition Unix behavior agrees with FreeBSD, but this
predates symbolic links so the behavior is now dubious.)

Given all the above, the standard's wording could be
improved in several different ways, all elaborations of
Option 2.  Here are two possibilities:

  Option 2A - require that files be hashed among all
  operands, and that disk usage be counted at most once.

    Change line 84170 [du DESCRIPTION] from:

      Files with multiple links shall be counted and written
      for only one entry.

    to:

      A file that occurs multiple times shall be counted and
      written for only one entry, even if the occurrences
      are under different file operands.

  Option 2B - leave unspecified whether files are hashed
  among all operands, and leave unspecified whether disk
  usage is counted multiple times for files whose link
  count does not exceed 1.  From the user's point of view,
  this means du's output is a reliable count of disk usage
  only if du is invoked without -L and with -x and with at
  most one operand.

    Change line 84170 [du DESCRIPTION] from:

      Files with multiple links shall be counted and written
      for only one entry.

    to:

      A file that occurs multiple times under one file
      operand and that has a link count greater than 1 shall
      be counted and written for only one entry.  It is
      implementation-defined whether a file that has a link
      count no greater than 1 is counted and written just
      once, or is counted and written for each occurrence.
      It is implementation-defined whether a file that
      occurs under one file operand is counted for other
      file operands.

Option 2A is simpler and clearer, but it invalidates many
existing implementations.  Option 2B modifies the standard
to describe how existing implementations actually work, but
is more complicated and more of a hassle to use reliably.

Eric raised one other issue: the description of the -a
option implies that "du A B" must always list B.  This
implication is incorrect for 7th edition Unix du, GNU du,
and (I expect) FreeBSD du, so it should be fixed as well.
Here's one possible fix, which is independent of the
abovementioned changes.

  Change line ????? [du OPTIONS] from:

    Regardless of the presence of the -a option,
    non-directories given as file operands shall always
    be listed.

  to:

    The -a option does not affect whether
    non-directories given as file operands are listed.

(Sorry, I don't know the line number here; I don't have a
PDF copy of the current standard and don't know offhand how
to get one.)





Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 19 Dec 2011 09:13:02 GMT) Full text and rfc822 format available.

Message #67 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: Elliott Forney <idfah <at> cs.colostate.edu>, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: "10281 <at> debbugs.gnu.org" <10281 <at> debbugs.gnu.org>
Subject: RE: bug#10281: change in behavior of du with multiple arguments
	(commit	efe53cc)
Date: Mon, 19 Dec 2011 10:10:33 +0100
Elliott Forney wrote:

> > The intent of the POSIX spec is that files should be counted only once,
> > regardless of whether they are arrived at via hard links, or by following
> > symbolic links with -L, or by any other means.
> 
> I agree that symlinks and hard links and maybe even bind mounts or
> whatever else should not be counted twice.  I do think, however, that
> multiple command line arguments should be counted individually since
> they were explicitly specified by the user.  At least by default.
> 
> My proposed solution would be the following:
> 
> By default, files with the same inode should only be counted once for
> each command line argument.  This can already be overridden with
> --count-links.  However, everything should be reset between command
> line arguments so that multiple command line arguments are counted
> individually.  As pointed out by Alan Curry, -c should be used to get
> a correct total.
> 
> In addition to this, it would be nice if there were a command line
> switch that allowed for files to only be counted once across all
> command line arguments, i.e. a switch to enable the current behavior.

+1

The big disadvantage of counting only once for all arguments is that
the result highly depends on the order of the arguments, even in a
simple case without symlinks and hardlinks:

  $ du -s * .
vs.
  $ du -s . *

That reminds me about a real-life question you could ask your little
daughter: "how many pupils are in each class and in total at school?".
I guess you would send her to extra math courses if she said "Class A
has 20, class B and class C have 25 each, and the school has 0."

This example doesn't claim to be 100% relevant for du, but shows
how "counting" and "summarizing" is burnt into human brains.

Have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Thu, 12 Jan 2012 17:51:01 GMT) Full text and rfc822 format available.

Message #70 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Don Cragun <dcragun <at> sonic.net>, 10281 <at> debbugs.gnu.org,
	austin-group-l <austin-group-l <at> opengroup.org>
Subject: Re: [1003.1(2008)/Issue 7 0000527]: du and files found via multiple
	command line arguments
Date: Thu, 12 Jan 2012 10:49:34 -0700
[Message part 1 (text/plain, inline)]
This topic came up again on the Austin Group call today, with no good
resolution yet.

On 12/18/2011 03:03 PM, Paul Eggert wrote:
> Eric Blake's Option 1 does not appear to be tenable, as du
> traditionally preserved hashes of duplicate files across all
> of its operands.  7th Edition Unix 'du' did that, and (as
> Jilles Tjoelker pointed out) so do at least two current 'du'
> implementations, namely, FreeBSD and GNU.
> 
> The idea behind Eric's Option 2 is better, but its wording
> is unclear partly because of another issue Jilles raised:
> whether a file's disk space should be counted multiple times
> if the file occurs multiple times and its link count is 1.
> For example:
> 
>   mkdir d
>   cd d
>   cp /bin/sh w
>   cp w y
>   ln y ../y
>   ln -s w x
>   ln -s y z
>   du -aL
> 
> This analyzes a directory with two regular files, 'w' and
> 'y'.  GNU and Solaris du count these files once each, with
> an accurate sum of non-symlink disk usage under the current
> directory.  But w's link count is 1 so FreeBSD counts 'w'
> twice, thus overcounting disk usage.
> 
> The current POSIX wording does not say what to do for this
> example, but the intent is to avoid overcounting disk usage,
> and the GNU and Solaris behavior supports this intent better.
> (The 7th Edition Unix behavior agrees with FreeBSD, but this
> predates symbolic links so the behavior is now dubious.)

One of the points made is that the standard currently requires elision
only for files with link counts > 1.  An interesting example with
FreeBSD du:

$ echo > a
$ du -a a a
2       a
2       a
$ ln a b
$ du -a a a
2      a
$

That is, the second argument was elided when the inode for 'a' is found
in the hash, which means the hash is preserved across arguments; but the
inode for 'a' is only put in the hash if the link count is > 1.

> 
> Given all the above, the standard's wording could be
> improved in several different ways, all elaborations of
> Option 2.  Here are two possibilities:
> 
>   Option 2A - require that files be hashed among all
>   operands, and that disk usage be counted at most once.
> 
>     Change line 84170 [du DESCRIPTION] from:
> 
>       Files with multiple links shall be counted and written
>       for only one entry.
> 
>     to:
> 
>       A file that occurs multiple times shall be counted and
>       written for only one entry, even if the occurrences
>       are under different file operands.
> 
>   Option 2B - leave unspecified whether files are hashed
>   among all operands, and leave unspecified whether disk
>   usage is counted multiple times for files whose link
>   count does not exceed 1.  From the user's point of view,
>   this means du's output is a reliable count of disk usage
>   only if du is invoked without -L and with -x and with at
>   most one operand.
> 
>     Change line 84170 [du DESCRIPTION] from:
> 
>       Files with multiple links shall be counted and written
>       for only one entry.
> 
>     to:
> 
>       A file that occurs multiple times under one file
>       operand and that has a link count greater than 1 shall
>       be counted and written for only one entry.  It is
>       implementation-defined whether a file that has a link
>       count no greater than 1 is counted and written just
>       once, or is counted and written for each occurrence.
>       It is implementation-defined whether a file that
>       occurs under one file operand is counted for other
>       file operands.
> 
> Option 2A is simpler and clearer, but it invalidates many
> existing implementations.  Option 2B modifies the standard
> to describe how existing implementations actually work, but
> is more complicated and more of a hassle to use reliably.
> 
> Eric raised one other issue: the description of the -a
> option implies that "du A B" must always list B.  This
> implication is incorrect for 7th edition Unix du, GNU du,
> and (I expect) FreeBSD du, so it should be fixed as well.
> Here's one possible fix, which is independent of the
> abovementioned changes.
> 
>   Change line ????? [du OPTIONS] from:
> 
>     Regardless of the presence of the -a option,
>     non-directories given as file operands shall always
>     be listed.
> 
>   to:
> 
>     The -a option does not affect whether
>     non-directories given as file operands are listed.
> 
> (Sorry, I don't know the line number here; I don't have a
> PDF copy of the current standard and don't know offhand how
> to get one.)

It boils down to a decision of whether we want to standardize a useful
behavior, and whether that behavior avoids over-counting, but possibly
invalidating existing implementations (in which case, it is better
targetted to Issue 8), or whether we give up and declare things
unspecified when encountering files with link count of 1 through
multiple locations (in which case we could make the changes in TC2 of
Issue 7, and still make recommendations on the underlying goal of
avoiding over-counting).

The call today also mentioned that cpio may have a similar issue on
overcounting.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Thu, 12 Jan 2012 18:31:01 GMT) Full text and rfc822 format available.

Message #73 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>
Cc: Don Cragun <dcragun <at> sonic.net>, 10281 <at> debbugs.gnu.org,
	austin-group-l <austin-group-l <at> opengroup.org>
Subject: Re: bug#10281: [1003.1(2008)/Issue 7 0000527]: du and files found
	via multiple command line arguments
Date: Thu, 12 Jan 2012 10:29:56 -0800
On 01/12/12 09:49, Eric Blake wrote:
> It boils down to a decision of whether we want to standardize a useful
> behavior, and whether that behavior avoids over-counting, but possibly
> invalidating existing implementations (in which case, it is better
> targetted to Issue 8), or whether we give up and declare things
> unspecified when encountering files with link count of 1 through
> multiple locations (in which case we could make the changes in TC2 of
> Issue 7, and still make recommendations on the underlying goal of
> avoiding over-counting).

We can do both, and it makes sense to do both.
That is, we can have Issue 7 TC2 specify Option 2B
with a suggestion to implement Option 2A, and have
Issue 8 require Option 2A.

> cpio may have a similar issue on overcounting.

Good point; it's likely that many implementations of pax, cpio, and
tar have problems in this area.  I fixed this bug in GNU tar
in 2010, here:

http://git.savannah.gnu.org/cgit/tar.git/commit/?id=37ddfb0b7eb41cc3f58bce686d389b1e965e9ccf




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 23 Jan 2012 17:15:01 GMT) Full text and rfc822 format available.

Message #76 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: austin-group-l <austin-group-l <at> opengroup.org>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: [1003.1(2008)/Issue 7 0000527]: du and files found
	via multiple command line arguments
Date: Mon, 23 Jan 2012 09:13:56 -0800
On 01/13/2012, Geoff Clare wrote:

> One problem with requiring Option 2A is that it requires du to use
> much more memory for hierarchies where there are large numbers
> of files with link count 1.  This could be a problem for embedded
> systems in particular.

The extra memory shouldn't be needed in the typical case
where there is at most one file operand and where -L is not used.
In the typical case, du is within its rights to not hash
files whose link count is 1, even if Option 2A is required.
This is because in the typical case du can't encounter the
same file twice if its link count is 1.




Forcibly Merged 10281 10282 11526. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 20 May 2012 21:19:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sat, 16 Feb 2013 10:06:02 GMT) Full text and rfc822 format available.

Message #81 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <idfah <at> cs.colostate.edu>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit	efe53cc)
Date: Sat, 16 Feb 2013 09:59:59 +0000 (UTC)
Does anyone know if anything ever happened with this?  Did we get clarification
on POSIX?





Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Sun, 17 Feb 2013 06:12:01 GMT) Full text and rfc822 format available.

Message #84 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Elliott Forney <idfah <at> cs.colostate.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Sun, 17 Feb 2013 00:09:57 -0600
On 02/16/2013 03:59 AM, Elliott Forney wrote:
> Does anyone know if anything ever happened with this?  Did we get clarification
> on POSIX?
My reading of <http://austingroupbugs.net/view.php?id=527#c1104>
is that POSIX allows but does not require the current GNU behavior,
and that future versions of POSIX may require the current GNU behavior.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 18 Feb 2013 08:20:01 GMT) Full text and rfc822 format available.

Message #87 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <idfah <at> cs.colostate.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 18 Feb 2013 01:18:23 -0700
> My reading of <http://austingroupbugs.net/view.php?id=527#c1104>
> is that POSIX allows but does not require the current GNU behavior,
> and that future versions of POSIX may require the current GNU behavior.

Thanks Paul, I agree with your reading of this.  Sounds like POSIX
allows both the new and old behaviors.

I must express, however, that I think this is a case where both the
standard and the current implementation were well-intentioned but not
well thought out.  Please allow me to state some reasons why I am
opposed to the current behavior followed by an example.  If I fail to
persuade people then I will let this issue be.

1.  I find it unintuitive that the number in a line of output from du
does not necessarily reflect the size of the corresponding directory
or file.  Without being privy to du's behavior regarding links and
multiple command-line arguments, this would be my expectation.

2.  Although I can see how it might add functionality to avoid
recounting files with a link count greater than one (although I don't
find it personally useful) I do not see any added benefit of not
recounting files with link count equal to one (e.g., across multiple
command-line arguments).  This is where I think the implications of
the POSIX standard were not well thought out.  I think that the
intention was to prevent counting files multiple times if there were
multiple links to the same file.  As an ill-considered side effect of
this, and particularly in the current implementation, we now find that
du will not recount across multiple command-line arguments.  I have an
examples of how this is confusing below.

3.  I find it surprising that the order of command-line arguments to
du may affect the output of du.  Users don't expect this.

4.  This deviates significantly from other implementations and
historical behavior.  To my knowledge, gnu-coreutils is the
odd-man-out with all other implementations following the previous
behavior.

5.  I couldn't agree with Bernhard Voelker more:
> That reminds me about a real-life question you could ask your little
> daughter: "how many pupils are in each class and in total at school?".
> I guess you would send her to extra math courses if she said "Class A
> has 20, class B and class C have 25 each, and the school has 0."
> This example doesn't claim to be 100% relevant for du, but shows
> how "counting" and "summarizing" is burnt into human brains.

Personally, I think that du should recount links and command-line
arguments everywhere except in the total, as reported by the -c flag.
This would add to the information reported by du without violating 1
above.

Let's consider an example.  I have actually had several people in my
office confused over variations of this same problem since du's
behavior has changed.  If several people have come looking for help,
this means that many more are confused and, worse yet, some
people/scripts probably haven't even caught the inconsistency.

Let's say that I want to answer the following question:  What are the
sizes of the directories "one" and "two" and "two/three" and all of
these directories combined.  Perhaps I know that "one" and "two" are
often too large and that "three" often causes "two" to grow too large.

Below is what my first principals would suggest I do.  Interestingly,
however, I notice that "two/three" is not reported at all.  So my
question is not answered.

$ du -ksc one two two/three
75096	one
4283824	two
4358920	total

Next, I might try reversing the order of arguments to see what
happens.  Now, I see that all are reported.  A hurried user might stop
here and go about their day.  A sly user will notice, however, that
"two/three" appears larger than "two" How is this possible?!

$ du -ksc one two/three two
75096	one
3184072	two/three
1099752	two
4358920	total

So, I might wander down the hall and visit with a friend.  He suggests
that I use the --count-links flag to allow recounting (even though
there are not multiple links in this scenario).  Now, everything is
reported and the numbers on each line of output match, but what
happened to my total?  This can't be right, it's larger than my entire
disk quota!

$ du -ksc --count-links one two two/three
75096	one
4283824	two
3184072	two/three
7542992	total

Turns out that --count-links sums all the output, even those that are
recounted, yielding an incorrect total.  Finally, I break down and use
three commands get the answer I was looking for.  Not pretty.

$ du -ks one two; du -ks two/three; du -ksc one two | tail -1
75096	one
4283824	two
3184072	two/three
4358920	total

Is it just me or does anyone else think this is convoluted?




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Tue, 19 Feb 2013 01:58:02 GMT) Full text and rfc822 format available.

Message #90 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Elliott Forney <idfah <at> cs.colostate.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Mon, 18 Feb 2013 17:56:50 -0800
On 02/18/2013 12:18 AM, Elliott Forney wrote:

> I find it unintuitive that the number in a line of output from
> du does not necessarily reflect the size of the corresponding
> directory or file.

That's been true forever.  All versions of 'du' do that.
For example, with Solaris 11 du:

$ ls
$ echo xxx >a/f
$ ln a/f b/f
$ du
16      ./a
8       ./b
32      .
$ du a
16      a
$ du b
16      b

This sort of thing has been the normal behavior since the
1970s and it's unlikely that we'll want to change it now.

> I find it surprising that the order of command-line arguments to
> du may affect the output of du.  Users don't expect this.

Also, users might not expect that the order that directory entries
are explored can affect the numbers that du outputs.  It's the same thing.
But we can't and shouldn't change the the directory-entry behavior;
and this is an argument that we shouldn't change the behavior for
command-line arguments as well.

> Is it just me or does anyone else think this is convoluted?

Yes, it's convoluted, but complexity is inherent to any
file system that has hard links and where 'du' is supposed to
count actual and not apparent disk usage.  There's no way to
escape this complexity entirely: it has to be reported to the
user somehow.  No matter what method 'du' uses, one will be
able to construct confusing examples like the above.  So the
mere existence of examples that cause confusion is not sufficient
evidence that we should change du's behavior.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Tue, 19 Feb 2013 07:12:02 GMT) Full text and rfc822 format available.

Message #93 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <idfah <at> cs.colostate.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Tue, 19 Feb 2013 00:09:53 -0700
> That's been true forever.  All versions of 'du' do that.
> For example, with Solaris 11 du:
>
> $ ls
> $ echo xxx >a/f
> $ ln a/f b/f
> $ du
> 16      ./a
> 8       ./b
> 32      .
> $ du a
> 16      a
> $ du b
> 16      b

The example you give is different in one important way:  it involves
multiple hard links.  The example I gave involved only nested
directories.

With Solaris du, each line corresponds to the file/directory size
given that there are not multiple links:

$ mkdir one one/two
$ echo xxx > one/two/f
$ du -s one one/two
12      one
8       one/two
$ du -s one/two one
8       one/two
12      one

Whereas in gnu du we now have:

$ du -s one one/two
12	one
$ du -s one/two one
8	one/two
4	one

I guess part of the reason I see this so much is that I tend to use
the -s option to get the summary sizes of a list of directories and
sometimes they are nested.  I really think this is the most common
usage for users.

> This sort of thing has been the normal behavior since the
> 1970s and it's unlikely that we'll want to change it now.

The behavior that we've had since the 70's was changed in the recent
release of du.  That's the whole reason this bug report popped up.  I
will admit, however, that I sometimes think we should all be more open
to changes but, in this case, I think the change is more confusing.

> Also, users might not expect that the order that directory entries
> are explored can affect the numbers that du outputs.  It's the same thing.
> But we can't and shouldn't change the the directory-entry behavior;
> and this is an argument that we shouldn't change the behavior for
> command-line arguments as well.

Yes, this is a justification for having the order of arguments matter,
but only if one concedes that multiple command-line arguments should
not be recounted.

> No matter what method 'du' uses, one will be
> able to construct confusing examples like the above.  So the
> mere existence of examples that cause confusion is not sufficient
> evidence that we should change du's behavior.

Of course convoluted examples can always be constructed, but the
example I gave should not be overly complex.  It is a typical use
case.

It really boils down to whether or not multiple command line arguments
should be recounted.  There is no question in my mind about recounting
multiple links (not recounting links is the historical behavior if for
no other reason).




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Tue, 19 Feb 2013 16:39:01 GMT) Full text and rfc822 format available.

Message #96 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Elliott Forney <idfah <at> cs.colostate.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Tue, 19 Feb 2013 08:37:34 -0800
On 02/18/2013 11:09 PM, Elliott Forney wrote:
> There is no question in my mind about recounting
> multiple links (not recounting links is the historical behavior if for
> no other reason).

Unfortunately if memory serves that's not the case either.
In non-GNU implementations, multiple links are sometimes counted
and sometimes not.  If you run 'du a b' and a/f and b/f
are hard links, they might be counted twice, and might not be -- it
depends on the implementation.

The area is indeed a mess and no matter what GNU du does it will
disagree with somebody.

It might be reasonable to add an option to GNU du to have it mimic
the behavior you prefer, if we can nail down exactly what that is.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Tue, 19 Feb 2013 23:28:01 GMT) Full text and rfc822 format available.

Message #99 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Elliott Forney <idfah <at> cs.colostate.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit efe53cc)
Date: Tue, 19 Feb 2013 16:26:17 -0700
> Unfortunately if memory serves that's not the case either.
> In non-GNU implementations, multiple links are sometimes counted
> and sometimes not.  If you run 'du a b' and a/f and b/f
> are hard links, they might be counted twice, and might not be -- it
> depends on the implementation.

I wasn't aware of that, sounds like there have been inconsistencies
for a long time.

> The area is indeed a mess and no matter what GNU du does it will
> disagree with somebody.

Sure.  As you stated before, filesystems are inherently complex,
especially when links are allowed.  I agree with that.  I apologize if
I am being difficult.  I just want to be sure this discussion is had
because it seems like this is a significant change in behavior.  It
does make sense, it just doesn't feel intuitive to me.

> It might be reasonable to add an option to GNU du to have it mimic
> the behavior you prefer, if we can nail down exactly what that is.

Well, what really bothers me is the second command below.

$ du -ks tmp tmp/bash
1033864	tmp

$ du -ks tmp/bash tmp
182684	tmp/bash
851180	tmp

The size of tmp is underrepresented even though there are no links.
To be honest, however, I can't think of a good way around this.  It
would be nice to simply omit tmp/bash since it is in a directory under
another argument but I'm unsure how that would be achieved without
making things even more convoluted and less efficient.  I am at a
loss.

Once a user is aware of what is going on, the problem can be avoided
either by using multiple du commands or by using --count-links and
understanding that any totals and links may be overcounted.  If you
are confident this is the best way to go forward then I can live with
it.




Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Wed, 20 Feb 2013 10:17:01 GMT) Full text and rfc822 format available.

Message #102 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: Elliott Forney <idfah <at> cs.colostate.edu>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
	(commit	efe53cc)
Date: Wed, 20 Feb 2013 02:15:25 -0800
Elliott Forney wrote:
> $ du -ks tmp tmp/bash
> 1033864	tmp
>
> $ du -ks tmp/bash tmp
> 182684	tmp/bash
> 851180	tmp
>
> The size of tmp is underrepresented even though there are no links.
>   
----
   Is it?  I mean is it showing less space that the sum of the
files in the tmp dir exclusive of the directory you singled out
for a separate totally?

I.e. by singling out tmp/bash, it could easily be taken that you
want it be be tallied separately from tmp and not have it's space
included -- vs.  the way you seem to want it -- which would be to
provide a total.

In the specific example you show, du -cks shows what you want,
but easily might not depending on your args.

The only way to do what you want would be to have a running tree
mode, where another column displays. a running total for the dirs
you specified that has 1 column being a total of everything under
it, and a 2nd column that shows that total with any dirs specified
on the command line removed from the tally:

Ex: supposed we have dirs w/sizes of files in the dirs
a:100
 a/b:100, a/c:100, a/d:100,
   a/b/e:100
a2:100
---
If I specified:

du [some-trigger-arg] a/b/e a/b a a2
100   100   a/b/e
200   100   a/b
500   300   a         #**          
100   100   a2
 ** 2nd number includes dirs 'c' and 'd' as they were not mentioned 
separately.

I'm not sure what the cost/benefit ratio is on this, at this
time... might rise.

Though you'd also have to be specific -- what would happen
if the files in a/b/e were hard links to 'a'... output
could still be different than what you wanted.  Maybe a perl
script to munge the output? then an alias and/or function to
call your extension when you wanted?

alias would be easy, but if you wanted to get fancy,
you could make a:

function du_special
}

function du {
.. if flag1=myflag, run du_special else
'du' "$@"      ## run real du...
}

I asked for the -h on sort about 5-6 years ago, but
no one wanted it then... now it's just there.  Unfortunately
I find that I'm often some number of years ahead of where
critical mass to have something happen is...;-/

Just some random suggesting...








Information forwarded to bug-coreutils <at> gnu.org:
bug#10281; Package coreutils. (Mon, 15 Oct 2018 15:04:01 GMT) Full text and rfc822 format available.

Message #105 received at 10281 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 10281 <at> debbugs.gnu.org
Subject: Re: bug#10281: change in behavior of du with multiple arguments
 (commit efe53cc)
Date: Mon, 15 Oct 2018 09:03:16 -0600
severity 10281 wishlist
tags 10281 wontfix
retitle 10281 du: hard-links counting with multiple arguments  (commit 
efe53cc)
close 10281

(triaging old bugs)

[...]
On 18/02/13 01:18 AM, Elliott Forney wrote:
>> My reading of <http://austingroupbugs.net/view.php?id=527#c1104>
>> is that POSIX allows but does not require the current GNU behavior,
>> and that future versions of POSIX may require the current GNU behavior.
>
> Thanks Paul, I agree with your reading of this.  Sounds like POSIX
> allows both the new and old behaviors.

Long and interesting thread about du, hard-links, and POSIX interpretations,
but with no follow-up in 5 years and no change in du's behavior,
I'm closing this bug.

regards,
 - assaf





Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 15:04:02 GMT) Full text and rfc822 format available.

Added tag(s) wontfix. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 15:04:02 GMT) Full text and rfc822 format available.

Changed bug title to 'du: hard-links counting with multiple arguments (commit' from 'change in behavior of du with multiple arguments (commit efe53cc)' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 15:04:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 10281 <at> debbugs.gnu.org and Paul Eggert <eggert <at> cs.ucla.edu> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 15:04:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 13 Nov 2018 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 137 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.