Package: coreutils;
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Sun, 27 Jun 2010 22:10:02 UTC
Severity: normal
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6524 in the body.
You can then email your comments to 6524 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 27 Jun 2010 22:10:02 GMT) Full text and rfc822 format available.Jim Meyering <jim <at> meyering.net>
:bug-coreutils <at> gnu.org
.
(Sun, 27 Jun 2010 22:10:03 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: bug-coreutils <at> gnu.org Subject: du now uses less than half as much memory, sometimes Date: Mon, 28 Jun 2010 00:09:04 +0200
FYI, This is just a preliminary heads-up. At least one detail must be addressed before I push: I haven't written justification for the change to hash.c (I confess that I don't remember -- parts of this change date back to Dec 2008 ;-) Feedback welcome, as usual. --------------------------- Quick summary, here's the NEWS addition: du now uses less than half as much memory when operating on trees with many hard-linked files. With --count-links (-l), or when operating on trees with no hard-linked files, there is no change. This started about two years ago, when someone reported that du would use what seemed like an inordinate amount of RAM when operating on many hard-linked files. As you can see (read the comments and logs below), du must use *some* space for each and every file with a link count of 2 or greater. However, there was plenty of room for improvement. From the last two change sets, (included mostly for reference -- I doubt I'll push either), you can see that the improved du now consumes far less memory than the old one. Somewhat less than half as much. Note the vertical scale change and that the total instruction count (x axis) is nearly identical; the new one is a tiny bit faster. [Notice the peaks; they occur each time hash.c rehashes into a larger table. Also notice that with the old version, memory usage grows between rehash calls, while in the new one, it is flat. ] +On an x86_64 system: + +-------------------------------------------------------------------------------- +Command: ./src/du.old -a /t/z +Massif arguments: --massif-out-file=m.old +ms_print arguments: m.old +-------------------------------------------------------------------------------- + + + MB +5.618^ # + | # + | # + | # + | # + | # ::: + | @ #:::::::: + | @ #:::::::: + | @ ::#:::::::: + | @ :@:::::@::#:::::::@ + | @ @:::@::::@:::::@::#:::::::@ + | @ @:::@::::@:::::@::#:::::::@ + | @ ::::@::::@:::@::::@:::::@::#:::::::@ + | @@ @:::: ::@::::@:::@::::@:::::@::#:::::::@ + | @ :::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @ @ :::::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @::::::@ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | ::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @ :@:::::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | ::@@:::@:: ::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + 0 +----------------------------------------------------------------------->Mi + 0 189.6 + +Number of snapshots: 69 + Detailed snapshots: [4, 5, 9, 14, 19, 26, 32, 37, 42, 47, 54, 57 (peak), 67] + +-------------------------------------------------------------------------------- + n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) +-------------------------------------------------------------------------------- + 0 0 0 0 0 0 + 1 2,515,906 185,176 170,412 14,764 0 + 2 5,834,589 258,000 231,344 26,656 0 +-------------------------------------------------------------------------------- +Command: ./src/du.new -a /t/z +Massif arguments: --massif-out-file=m.new +ms_print arguments: m.new +-------------------------------------------------------------------------------- + + + MB +3.145^ # + | # + | # + | # + | # + | # + | @@ # + | @ # + | @ # :: + | @ #:::::@::::@:::::@ + | @ @ #:::::@::::@:::::@ + | @ @ #:::::@::::@:::::@ + | @ @ ::::::::@:::::#:::::@::::@:::::@ + | @@ @ @ ::::: ::@:::::#:::::@::::@:::::@ + | @ @::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ @::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | :::@@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ :::@:: @@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | :@::@@@:::@:: @@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + 0 +----------------------------------------------------------------------->Mi + 0 188.8 + +Number of snapshots: 93 + Detailed snapshots: [4, 7, 8, 9, 14, 18, 19, 21, 29, 40, 51, 60 (peak), 70, 80, 90] + +-------------------------------------------------------------------------------- + n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) +-------------------------------------------------------------------------------- + 0 0 0 0 0 0 + 1 3,286,259 134,120 129,254 4,866 0 + 2 5,525,860 226,432 218,599 7,833 0 -- 1.7.1.764.g84d391 From bc84b17d6436e9efb8d8946bd4b6e28ef40621b8 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Fri, 11 Jun 2010 19:43:49 +0200 Subject: [PATCH 1/7] include patched hash module * gl/lib/hash.c.diff: New file. * gl/lib/hash.h.diff: New file. --- gl/lib/hash.c.diff | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++ gl/lib/hash.h.diff | 12 ++++++ 2 files changed, 108 insertions(+), 0 deletions(-) create mode 100644 gl/lib/hash.c.diff create mode 100644 gl/lib/hash.h.diff diff --git a/gl/lib/hash.c.diff b/gl/lib/hash.c.diff new file mode 100644 index 0000000..72b1cb4 --- /dev/null +++ b/gl/lib/hash.c.diff @@ -0,0 +1,96 @@ +diff --git a/lib/hash.c b/lib/hash.c +index f9abb9f..97dbd40 100644 +--- a/lib/hash.c ++++ b/lib/hash.c +@@ -1020,14 +1020,13 @@ hash_rehash (Hash_table *table, size_t candidate) + return false; + } + +-/* If ENTRY matches an entry already in the hash table, return the pointer +- to the entry from the table. Otherwise, insert ENTRY and return ENTRY. +- Return NULL if the storage required for insertion cannot be allocated. +- This implementation does not support duplicate entries or insertion of +- NULL. */ +- +-void * +-hash_insert (Hash_table *table, const void *entry) ++/* Return -1 upon memory allocation failure. ++ Return 1 if insertion succeeded. ++ Return 0 if there is already a matching entry in the table, ++ and in that case, if MATCHED_ENT is non-NULL, set *MATCHED_ENT ++ to that entry. */ ++int ++hash_insert0 (Hash_table *table, void const *entry, void const **matched_ent) + { + void *data; + struct hash_entry *bucket; +@@ -1038,7 +1037,11 @@ hash_insert (Hash_table *table, const void *entry) + + /* If there's a matching entry already in the table, return that. */ + if ((data = hash_find_entry (table, entry, &bucket, false)) != NULL) +- return data; ++ { ++ if (matched_ent) ++ *matched_ent = data; ++ return 0; ++ } + + /* If the growth threshold of the buckets in use has been reached, increase + the table size and rehash. There's no point in checking the number of +@@ -1062,11 +1065,11 @@ hash_insert (Hash_table *table, const void *entry) + * tuning->growth_threshold)); + + if (SIZE_MAX <= candidate) +- return NULL; ++ return -1; + + /* If the rehash fails, arrange to return NULL. */ + if (!hash_rehash (table, candidate)) +- return NULL; ++ return -1; + + /* Update the bucket we are interested in. */ + if (hash_find_entry (table, entry, &bucket, false) != NULL) +@@ -1081,7 +1084,7 @@ hash_insert (Hash_table *table, const void *entry) + struct hash_entry *new_entry = allocate_entry (table); + + if (new_entry == NULL) +- return NULL; ++ return -1; + + /* Add ENTRY in the overflow of the bucket. */ + +@@ -1089,7 +1092,7 @@ hash_insert (Hash_table *table, const void *entry) + new_entry->next = bucket->next; + bucket->next = new_entry; + table->n_entries++; +- return (void *) entry; ++ return 1; + } + + /* Add ENTRY right in the bucket head. */ +@@ -1098,7 +1101,23 @@ hash_insert (Hash_table *table, const void *entry) + table->n_entries++; + table->n_buckets_used++; + +- return (void *) entry; ++ return 1; ++} ++ ++/* If ENTRY matches an entry already in the hash table, return the pointer ++ to the entry from the table. Otherwise, insert ENTRY and return ENTRY. ++ Return NULL if the storage required for insertion cannot be allocated. ++ This implementation does not support duplicate entries or insertion of ++ NULL. */ ++ ++void * ++hash_insert (Hash_table *table, void const *entry) ++{ ++ void const *matched_ent; ++ int err = hash_insert0 (table, entry, &matched_ent); ++ return (err == -1 ++ ? NULL ++ : (void *) (err == 0 ? matched_ent : entry)); + } + + /* If ENTRY is already in the table, remove it and return the just-deleted diff --git a/gl/lib/hash.h.diff b/gl/lib/hash.h.diff new file mode 100644 index 0000000..8219911 --- /dev/null +++ b/gl/lib/hash.h.diff @@ -0,0 +1,12 @@ +diff --git a/lib/hash.h b/lib/hash.h +index e795cdc..0ae3fe5 100644 +--- a/lib/hash.h ++++ b/lib/hash.h +@@ -88,6 +88,7 @@ void hash_free (Hash_table *); + /* Insertion and deletion. */ + bool hash_rehash (Hash_table *, size_t) ATTRIBUTE_WUR; + void *hash_insert (Hash_table *, const void *) ATTRIBUTE_WUR; ++int hash_insert0 (Hash_table *table, const void *entry, const void **matched_ent); + void *hash_delete (Hash_table *, const void *); + + #endif -- 1.7.1.764.g84d391 From 563f6db676ff8c6e72345f0e446a707c534cdb58 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Sun, 27 Jun 2010 23:26:46 +0200 Subject: [PATCH 2/7] dev-map: map device number to small non-negative * gl/lib/dev-map.c: New file. * gl/lib/dev-map.h: Declarations. * gl/modules/dev-map: Define primary modules. * gl/modules/dev-map-tests: Define test module. * gl/tests/test-dev-map.c: Test it. --- gl/lib/dev-map.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++ gl/lib/dev-map.h | 23 ++++++++++ gl/modules/dev-map | 23 ++++++++++ gl/modules/dev-map-tests | 10 ++++ gl/tests/test-dev-map.c | 63 ++++++++++++++++++++++++++ 5 files changed, 229 insertions(+), 0 deletions(-) create mode 100644 gl/lib/dev-map.c create mode 100644 gl/lib/dev-map.h create mode 100644 gl/modules/dev-map create mode 100644 gl/modules/dev-map-tests create mode 100644 gl/tests/test-dev-map.c diff --git a/gl/lib/dev-map.c b/gl/lib/dev-map.c new file mode 100644 index 0000000..363f5af --- /dev/null +++ b/gl/lib/dev-map.c @@ -0,0 +1,110 @@ +/* Map a dev_t device number to a small integer. + + Copyright 2009-2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Jim Meyering */ + +#include <config.h> +#include "dev-map.h" + +#include <stdio.h> +#include <assert.h> +#include <stdint.h> +#include <stdlib.h> + +struct dev_map_ent +{ + dev_t dev; + uint32_t mapped_dev; +}; + +enum { INITIAL_DEV_MAP_TABLE_SIZE = 31 }; + +static size_t +dev_hash (void const *x, size_t table_size) +{ + struct dev_map_ent const *p = x; + return (uintmax_t) p->dev % table_size; +} + +/* Compare two dev_map_ent structs on "dev". + Return true if they are the same. */ +static bool +dev_compare (void const *x, void const *y) +{ + struct dev_map_ent const *a = x; + struct dev_map_ent const *b = y; + return a->dev == b->dev ? true : false; +} + +/* Initialize state. */ +int +dev_map_init (struct dev_map *dev_map) +{ + assert (dev_map != NULL); + dev_map->n_device = 0; + dev_map->dev_map = hash_initialize (INITIAL_DEV_MAP_TABLE_SIZE, NULL, + dev_hash, dev_compare, free); + return dev_map->dev_map ? 0 : -1; +} + +void +dev_map_free (struct dev_map *dev_map) +{ + hash_free (dev_map->dev_map); +} + +/* Insert device number, DEV, into the map, DEV_MAP if it's not there already, + and return the nonnegative, mapped and usually smaller, number. + If DEV is already in DEV_MAP, return the existing mapped value. + Upon memory allocation failure, return -1. */ +int +dev_map_insert (struct dev_map *dev_map, dev_t dev) +{ + /* Attempt to insert <DEV,n_device> in the map. + Possible outcomes: + - DEV was already there, with a different necessarily smaller value + - DEV was not there, (thus was just inserted) + - insert failed: out of memory + Return -1 on out of memory. + */ + + struct dev_map_ent *ent_from_table; + struct dev_map_ent *ent = malloc (sizeof *ent); + if (!ent) + return -1; + + ent->dev = dev; + ent->mapped_dev = dev_map->n_device; + + ent_from_table = hash_insert (dev_map->dev_map, ent); + if (ent_from_table == NULL) + { + /* Insertion failed due to lack of memory. */ + return -1; + } + + if (ent_from_table == ent) + { + /* Insertion succeeded: new device. */ + return dev_map->n_device++; + } + + /* That DEV value is already in the table, so ENT was not inserted. + Free it and return the already-associated value. */ + free (ent); + return ent_from_table->mapped_dev; +} diff --git a/gl/lib/dev-map.h b/gl/lib/dev-map.h new file mode 100644 index 0000000..f093d90 --- /dev/null +++ b/gl/lib/dev-map.h @@ -0,0 +1,23 @@ +#include <stddef.h> +#include <sys/types.h> +#include <sys/stat.h> +#include "hash.h" + +struct dev_map +{ + /* KEY,VAL pair, where KEY is the raw st_dev value + and VAL is the small number that maps to. */ + struct hash_table *dev_map; + size_t n_device; +}; + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m) +#endif + +int dev_map_init (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); +void dev_map_free (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); +int dev_map_insert (struct dev_map *, dev_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/modules/dev-map b/gl/modules/dev-map new file mode 100644 index 0000000..91f437b --- /dev/null +++ b/gl/modules/dev-map @@ -0,0 +1,23 @@ +Description: +maintain a mapping of dev_t numbers to small integers + +Files: +lib/dev-map.c +lib/dev-map.h + +Depends-on: +hash + +configure.ac: + +Makefile.am: +lib_SOURCES += dev-map.c dev-map.h + +Include: +"dev-map.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/dev-map-tests b/gl/modules/dev-map-tests new file mode 100644 index 0000000..4bec2e6 --- /dev/null +++ b/gl/modules/dev-map-tests @@ -0,0 +1,10 @@ +Files: +tests/test-dev-map.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-dev-map +check_PROGRAMS += test-dev-map diff --git a/gl/tests/test-dev-map.c b/gl/tests/test-dev-map.c new file mode 100644 index 0000000..98ba630 --- /dev/null +++ b/gl/tests/test-dev-map.c @@ -0,0 +1,63 @@ +/* Test the dev-map module. + Copyright (C) 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* Written by Jim Meyering. */ + +#include <config.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +/* FIXME: once/if in gnulib, use #include "macros.h" in place of this */ +#define ASSERT(expr) \ + do \ + { \ + if (!(expr)) \ + { \ + fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ + fflush (stderr); \ + abort (); \ + } \ + } \ + while (0) + +#include "dev-map.h" + +/* Risky: this is also defined in di-set.c; they should match. */ +#define N_DEV_BITS_4 5 + +int +main () +{ + /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ + struct dev_map dev_map; + ASSERT (dev_map_init (&dev_map) == 0); + + ASSERT (dev_map_insert (&dev_map, 42) == 0); + ASSERT (dev_map_insert (&dev_map, 42) == 0); + ASSERT (dev_map_insert (&dev_map, 398) == 1); + ASSERT (dev_map_insert (&dev_map, 398) == 1); + + int32_t i; + for (i = 0; i < (1 << N_DEV_BITS_4); i++) + { + ASSERT (dev_map_insert (&dev_map, 10000+i) == 2+i); + } + + dev_map_free (&dev_map); + + return 0; +} -- 1.7.1.764.g84d391 From 4fe5a595909b4f8d7b0f96d4d0a18d1e0f8906d2 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Sun, 27 Jun 2010 23:29:07 +0200 Subject: [PATCH 3/7] di-set: manipulate sets of dev/inode pairs efficiently * gl/lib/di-set.c: Implementation. * gl/lib/di-set.h: Declarations. * gl/modules/di-set: Define module. * gl/modules/di-set-tests: Define test module. * gl/tests/test-di-set.c: Likewise. --- gl/lib/di-set.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++ gl/lib/di-set.h | 28 +++++ gl/modules/di-set | 25 +++++ gl/modules/di-set-tests | 10 ++ gl/tests/test-di-set.c | 85 +++++++++++++++ 5 files changed, 424 insertions(+), 0 deletions(-) create mode 100644 gl/lib/di-set.c create mode 100644 gl/lib/di-set.h create mode 100644 gl/modules/di-set create mode 100644 gl/modules/di-set-tests create mode 100644 gl/tests/test-di-set.c diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c new file mode 100644 index 0000000..3c4717b --- /dev/null +++ b/gl/lib/di-set.c @@ -0,0 +1,276 @@ +/* Set operations for device-inode pairs stored in a space-efficient manner. + + Copyright 2009-2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Jim Meyering */ + +#include <config.h> +#include "di-set.h" + +#include <stdio.h> +#include <assert.h> +#include <stdint.h> +#include <stdlib.h> +#include <sys/types.h> +#include <sys/stat.h> + +#include "verify.h" + +/* Set operations for device-inode pairs stored in a space-efficient manner. + A naive mapping uses 16 bytes to save a single st_dev, st_ino pair. + However, in many applications, the vast majority of actual device,inode + number pairs can be efficiently compressed to fit in 8 or even 4 bytes, + by using a separate table to map a relatively small number of devices + to small integers. */ + +#define N_DEV_BITS_4 5 +#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) + +#define N_DEV_BITS_8 8 +#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) + +/* Note how the last bit is always set. + This is required, in order to be able to distinguish + an encoded di_ent value from a malloc-returned pointer, + which must be 4-byte-aligned or better. */ +struct dev_ino_4 +{ + uint32_t mode:2; /* must be first */ + uint32_t short_ino:N_INO_BITS_4; + uint32_t mapped_dev:N_DEV_BITS_4; + uint32_t always_set:1; +}; +verify (N_DEV_BITS_4 <= 8 * sizeof (int)); +verify (sizeof (struct dev_ino_4) == 4); + +struct dev_ino_8 +{ + uint32_t mode:2; /* must be first */ + uint64_t short_ino:N_INO_BITS_8; + uint32_t mapped_dev:N_DEV_BITS_8; + uint32_t always_set:1; +}; +verify (sizeof (struct dev_ino_8) == 8); + +struct dev_ino_full +{ + uint32_t mode:2; /* must be first */ + dev_t dev; + ino_t ino; +}; + +enum di_mode +{ + DI_MODE_4 = 1, + DI_MODE_8 = 2, + DI_MODE_FULL = 3 +}; + +/* + di_mode raw_inode mapped dev always_set + \____________|_______________\_____/ + 4-byte | 2| 25 | 5 |1| mapped_dev + `----------------------------------------------------|-----. + 8-byte | 2| 53 | 8 |1| + `----------------------------------------------------------' +*/ +struct di_ent +{ + union + { + struct dev_ino_4 di4; + struct dev_ino_8 di8; + struct dev_ino_full full; + uint32_t u32; + uint64_t u64; + void *ptr; + } u; +}; + +struct dev_map_ent +{ + dev_t dev; + uint32_t mapped_dev; +}; + +static inline bool +is_encoded_ptr (struct di_ent const *v) +{ + return (size_t) v % 4; +} + +static struct di_ent +decode_ptr (struct di_ent const *v) +{ + if (!is_encoded_ptr (v)) + return *v; + + struct di_ent di; + di.u.ptr = (void *) v; + return di; +} + +static size_t +di_ent_hash (void const *x, size_t table_size) +{ + struct di_ent e = decode_ptr (x); + return (e.u.di4.mode == DI_MODE_4 + ? e.u.u32 + : (e.u.di4.mode == DI_MODE_8 + ? e.u.u64 + : e.u.full.ino)) % table_size; +} + +/* Compare two di_ent structs. + Return true if they are the same. */ +static bool +di_ent_compare (void const *x, void const *y) +{ + struct di_ent a = decode_ptr (x); + struct di_ent b = decode_ptr (y); + if (a.u.di4.mode != b.u.di4.mode) + return false; + + if (a.u.di4.mode == DI_MODE_4) + return (a.u.di4.short_ino == b.u.di4.short_ino + && a.u.di4.mapped_dev == b.u.di4.mapped_dev); + + if (a.u.di8.mode == DI_MODE_8) + return (a.u.di8.short_ino == b.u.di8.short_ino + && a.u.di8.mapped_dev == b.u.di8.mapped_dev); + + return (a.u.full.ino == b.u.full.ino + && a.u.full.dev == b.u.full.dev); +} + +static void +di_ent_free (void *v) +{ + if ( ! is_encoded_ptr (v)) + free (v); +} + +int +di_set_init (struct di_set_state *dis, size_t initial_size) +{ + if (dev_map_init (&dis->dev_map) < 0) + return -1; + + dis->di_set = hash_initialize (initial_size, NULL, + di_ent_hash, di_ent_compare, di_ent_free); + return dis->di_set ? 0 : -1; +} + +void +di_set_free (struct di_set_state *dis) +{ + dev_map_free (&dis->dev_map); + hash_free (dis->di_set); +} + +/* Given a device-inode set, DIS, create an entry for the DEV,INO + pair, and store it in *V. If possible, encode DEV,INO into the pointer + itself, but if not, allocate space for a full "struct di_ent" and set *V + to that pointer. Upon memory allocation failure, return -1. + Otherwise return 0. */ +int +di_ent_create (struct di_set_state *dis, + dev_t dev, ino_t ino, + struct di_ent **v) +{ + static int prev_m = -1; + static dev_t prev_dev = -1; + struct di_ent di_ent; + int mapped_dev; + + if (dev == prev_dev) + mapped_dev = prev_m; + else + { + mapped_dev = dev_map_insert (&dis->dev_map, dev); + if (mapped_dev < 0) + return -1; + prev_dev = dev; + prev_m = mapped_dev; + } + + if (mapped_dev < (1 << N_DEV_BITS_4) + && ino < (1 << N_INO_BITS_4)) + { +#if lint + /* When this struct is smaller than a pointer, initialize + the pointer so tools like valgrind don't complain about + the uninitialized bytes. */ + if (sizeof di_ent.u.di4 < sizeof di_ent.u.ptr) + di_ent.u.ptr = NULL; +#endif + di_ent.u.di4.mode = DI_MODE_4; + di_ent.u.di4.short_ino = ino; + di_ent.u.di4.mapped_dev = mapped_dev; + di_ent.u.di4.always_set = 1; + *v = di_ent.u.ptr; + } + else if (mapped_dev < (1 << N_DEV_BITS_8) + && ino < ((uint64_t) 1 << N_INO_BITS_8)) + { + di_ent.u.di8.mode = DI_MODE_8; + di_ent.u.di8.short_ino = ino; + di_ent.u.di8.mapped_dev = mapped_dev; + di_ent.u.di8.always_set = 1; + *v = di_ent.u.ptr; + } + else + { + /* Handle the case in which INO is too large or in which (far less + likely) we encounter hard-linked files on 2^N_DEV_BITS_8 + different devices. */ + struct di_ent *p = malloc (sizeof *p); + if (!p) + return -1; + assert ((size_t) p % 4 == 0); + p->u.full.mode = DI_MODE_FULL; + p->u.full.ino = ino; + p->u.full.dev = dev; + *v = p; + } + + return 0; +} + +/* Attempt to insert the DEV,INO pair into the set, DIS. + If it matches a pair already in DIS, don't modify DIS and return 0. + Otherwise, if insertion is successful, return 1. + Upon any failure return -1. */ +int +di_set_insert (struct di_set_state *dis, dev_t dev, ino_t ino) +{ + struct di_ent *v; + if (di_ent_create (dis, dev, ino, &v) < 0) + return -1; + + int err = hash_insert0 (dis->di_set, v, NULL); + if (err == -1) /* Insertion failed due to lack of memory. */ + return -1; + + if (err == 1) /* Insertion succeeded. */ + return 1; + + /* That pair is already in the table, so ENT was not inserted. Free it. */ + if (! is_encoded_ptr (v)) + free (v); + + return 0; +} diff --git a/gl/lib/di-set.h b/gl/lib/di-set.h new file mode 100644 index 0000000..f90d0dd --- /dev/null +++ b/gl/lib/di-set.h @@ -0,0 +1,28 @@ +#include "dev-map.h" + +struct di_set_state +{ + /* A map to help compact device numbers. */ + struct dev_map dev_map; + + /* A set of compact dev,inode pairs. */ + struct hash_table *di_set; +}; + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m,...) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m,...) +#endif + +int di_set_init (struct di_set_state *, size_t) _ATTRIBUTE_NONNULL_ (1); +void di_set_free (struct di_set_state *) _ATTRIBUTE_NONNULL_ (1); +int di_set_insert (struct di_set_state *, dev_t, ino_t) + _ATTRIBUTE_NONNULL_ (1); + +struct di_ent; +int di_ent_create (struct di_set_state *di_set_state, + dev_t dev, ino_t ino, + struct di_ent **di_ent) + _ATTRIBUTE_NONNULL_ (1,4); diff --git a/gl/modules/di-set b/gl/modules/di-set new file mode 100644 index 0000000..fe52778 --- /dev/null +++ b/gl/modules/di-set @@ -0,0 +1,25 @@ +Description: +manipulate sets of device-inode pairs efficiently + +Files: +lib/di-set.c +lib/di-set.h + +Depends-on: +dev-map +hash +verify + +configure.ac: + +Makefile.am: +lib_SOURCES += di-set.c di-set.h + +Include: +"di-set.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/di-set-tests b/gl/modules/di-set-tests new file mode 100644 index 0000000..d60f7fd --- /dev/null +++ b/gl/modules/di-set-tests @@ -0,0 +1,10 @@ +Files: +tests/test-di-set.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-di-set +check_PROGRAMS += test-di-set diff --git a/gl/tests/test-di-set.c b/gl/tests/test-di-set.c new file mode 100644 index 0000000..4d5823e --- /dev/null +++ b/gl/tests/test-di-set.c @@ -0,0 +1,85 @@ +/* Test the dev-map module. + Copyright (C) 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* Written by Jim Meyering. */ + +#include <config.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> +#include <stdint.h> + +#define ASSERT(expr) \ + do \ + { \ + if (!(expr)) \ + { \ + fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ + fflush (stderr); \ + abort (); \ + } \ + } \ + while (0) + +#include "di-set.h" + +/* FIXME: ugly duplication of code from di-set.c */ +#define N_DEV_BITS_4 5 +#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) + +#define N_DEV_BITS_8 8 +#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) + +int +main (void) +{ + /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ + size_t initial_size = 61; + /* "real" code might prefer to avoid the allocation here, simply + declaring "struct di_set_state dis;", do a global substitution, + s/\<dis\>/\&dis/, and remove the final free. */ + struct di_set_state *dis = malloc (sizeof *dis); + ASSERT (dis); + ASSERT (di_set_init (dis, initial_size) == 0); + + struct di_ent *di_ent; + ASSERT (di_ent_create (dis, 1, 1, &di_ent) == 0); + ASSERT (di_ent_create (dis, 1 << N_DEV_BITS_4, 1, &di_ent) == 0); + ASSERT (di_ent_create (dis, 1, 1 << N_INO_BITS_4, &di_ent) == 0); + ASSERT (di_ent_create (dis, 1, + (uint64_t) 1 << N_INO_BITS_8, &di_ent) == 0); + free (di_ent); + + ASSERT (di_set_insert (dis, 2, 5) == 1); /* first insertion succeeds */ + ASSERT (di_set_insert (dis, 2, 5) == 0); /* duplicate fails */ + ASSERT (di_set_insert (dis, 3, 5) == 1); /* diff dev, duplicate inode is ok */ + ASSERT (di_set_insert (dis, 2, 8) == 1); /* same dev, different inode is ok */ + + /* very large inode number */ + ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 1); + ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 0); /* dup */ + + unsigned int i; + for (i = 0; i < 3000; i++) + ASSERT (di_set_insert (dis, 9, i) == 1); + for (i = 0; i < 3000; i++) + ASSERT (di_set_insert (dis, 9, i) == 0); /* duplicate fails */ + + di_set_free (dis); + free (dis); + + return 0; +} -- 1.7.1.764.g84d391 From ffe335589aea4429c0f9b3bb49e4fbe7bce509db Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Tue, 9 Dec 2008 08:49:41 +0100 Subject: [PATCH 4/7] du: use less than half as much memory when tracking hard links When processing a hard-linked file, du must keep track of the file's device and inode numbers in order to avoid counting its storage more than once. When du would process many hard linked files -- as are created by some backup tools -- the amount of memory required for the supporting data structure could become prohibitively large. This patch takes advantage of the fact that the amount of information in the numbers of the typical dev,inode pair is far less than even 32 bits, and hence usually fits in the space of a pointer, be it 32 or 64 bits wide. A typical du traversal examines files on no more than a handful of distinct devices, so the device number can be encoded in just a few bits. Similarly, few inode numbers use all of the high bits in an ino_t. Before, we would represent the dev,inode pair using a naive struct, and allocate space for each. Thus, an entry in the hash table consisted of a pointer (to that struct) and a "next" pointer. With this change, we encode the dev,inode information and put those bits in place of the pointer, and thus do away with the need to allocate additional space for each dev,inode pair. * src/du.c: Include "di-set.h". Don't include "hash.h"; it's no longer used. (INITIAL_DI_SET_SIZE): Define. (di_set): New global, to replace "htab". (entry_hash, entry_compare, hash_init): Remove functions. (hash_ins): Use di-set functions, rather than ones from the hash module. (main): Likewise. * bootstrap.conf (gnulib_modules): Add the new di-set module. * NEWS (New features): Mention it. --- NEWS | 4 +++ bootstrap.conf | 1 + src/du.c | 73 +++++++------------------------------------------------ 3 files changed, 15 insertions(+), 63 deletions(-) diff --git a/NEWS b/NEWS index 3e170c5..e22969b 100644 --- a/NEWS +++ b/NEWS @@ -7,6 +7,10 @@ GNU coreutils NEWS -*- outline -*- du recognizes -d N as equivalent to --max-depth=N, for compatibility with FreeBSD. + du now uses less than half as much memory when operating on trees + with many hard-linked files. With --count-links (-l), or when + operating on trees with no hard-linked files, there is no change. + sort now accepts the --debug option, to highlight the part of the line significant in the sort, and warn about questionable options. diff --git a/bootstrap.conf b/bootstrap.conf index 6e85c9a..644c18b 100644 --- a/bootstrap.conf +++ b/bootstrap.conf @@ -69,6 +69,7 @@ gnulib_modules=" cycle-check d-ino d-type + di-set diacrit dirfd dirname diff --git a/src/du.c b/src/du.c index 2704f0f..5838085 100644 --- a/src/du.c +++ b/src/du.c @@ -30,10 +30,10 @@ #include "system.h" #include "argmatch.h" #include "argv-iter.h" +#include "di-set.h" #include "error.h" #include "exclude.h" #include "fprintftime.h" -#include "hash.h" #include "human.h" #include "quote.h" #include "quotearg.h" @@ -61,18 +61,10 @@ extern bool fts_debug; #endif /* Initial size of the hash table. */ -#define INITIAL_TABLE_SIZE 103 - -/* Hash structure for inode and device numbers. The separate entry - structure makes it easier to rehash "in place". */ -struct entry -{ - ino_t st_ino; - dev_t st_dev; -}; +enum { INITIAL_DI_SET_SIZE = 103 }; /* A set of dev/ino pairs. */ -static Hash_table *htab; +static struct di_set_state di_set; /* Define a class for collecting directory information. */ @@ -334,66 +326,20 @@ Mandatory arguments to long options are mandatory for short options too.\n\ exit (status); } -static size_t -entry_hash (void const *x, size_t table_size) -{ - struct entry const *p = x; - - /* Ignoring the device number here should be fine. */ - /* The cast to uintmax_t prevents negative remainders - if st_ino is negative. */ - return (uintmax_t) p->st_ino % table_size; -} - -/* Compare two dev/ino pairs. Return true if they are the same. */ -static bool -entry_compare (void const *x, void const *y) -{ - struct entry const *a = x; - struct entry const *b = y; - return SAME_INODE (*a, *b) ? true : false; -} - /* Try to insert the INO/DEV pair into the global table, HTAB. Return true if the pair is successfully inserted, false if the pair is already in the table. */ static bool hash_ins (ino_t ino, dev_t dev) { - struct entry *ent; - struct entry *ent_from_table; - - ent = xmalloc (sizeof *ent); - ent->st_ino = ino; - ent->st_dev = dev; - - ent_from_table = hash_insert (htab, ent); - if (ent_from_table == NULL) + int inserted = di_set_insert (&di_set, dev, ino); + if (inserted < 0) { /* Insertion failed due to lack of memory. */ xalloc_die (); } - if (ent_from_table == ent) - { - /* Insertion succeeded. */ - return true; - } - - /* That pair is already in the table, so ENT was not inserted. Free it. */ - free (ent); - - return false; -} - -/* Initialize the hash table. */ -static void -hash_init (void) -{ - htab = hash_initialize (INITIAL_TABLE_SIZE, NULL, - entry_hash, entry_compare, free); - if (htab == NULL) - xalloc_die (); + return inserted ? true : false; } /* FIXME: this code is nearly identical to code in date.c */ @@ -945,8 +891,9 @@ main (int argc, char **argv) if (!ai) xalloc_die (); - /* Initialize the hash structure for inode numbers. */ - hash_init (); + /* Initialize the set of dev,inode pairs. */ + if (di_set_init (&di_set, INITIAL_DI_SET_SIZE)) + xalloc_die (); bit_flags |= symlink_deref_bits; static char *temp_argv[] = { NULL, NULL }; @@ -1024,7 +971,7 @@ main (int argc, char **argv) if (print_grand_total) print_size (&tot_dui, _("total")); - hash_free (htab); + di_set_free (&di_set); exit (ok ? EXIT_SUCCESS : EXIT_FAILURE); } -- 1.7.1.764.g84d391 From 83e06bbe9764aacb99f946cd1a2cda65340eaf14 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Fri, 4 Jun 2010 15:30:47 +0200 Subject: [PATCH 5/7] du: increase the initial dev-inode set size * src/du.c (INITIAL_DI_SET_SIZE): Increase to the prime just under 1024. This gives a speed-up of about 2% when processing a tree containing 100,000 files, each with a link count greater than 1, all pointing to files in some other tree. Show how/when this change is useful: $ z-mktree --br=300 --root=/tmp/z --depth=2 $ cp -lr /tmp/z /tmp/.j $ ./mem-usage-compare src/du -a /tmp/z --------------------------------------------------------------------- Command: ./src/du.old -a /t/z Massif arguments: --massif-out-file=m.old ms_print arguments: m.old --------------------------------------------------------------------- MB 5.618^... # | ... # | ... # | ... # | ... # | ... # :: | ... @ #::::::: | ... @ #:::::::@ | ... @ ::#:::::::@ | ... @ :@:::::::::#:::::::@ | ... @ @:::::::@:::::::::#:::::::@: | ... @ @::: :::@:::::::::#:::::::@: | ... @ ::@:::@::: :::@:::::::::#:::::::@: | ... @::::::::@:::@::: :::@:::::::::#:::::::@: | ... :@@::::: ::@:::@::: :::@:::::::::#:::::::@: | ... :@:::::@@::::: ::@:::@::: :::@:::::::::#:::::::@: | ... :::::::@::: :@@::::: ::@:::@::: :::@:::::::::#:::::::@: | ...:::: ::: :@::: :@@::::: ::@:::@::: :::@:::::::::#:::::::@: | ...:: : ::: :@::: :@@::::: ::@:::@::: :::@:::::::::#:::::::@: | :...:: : ::: :@::: :@@::::: ::@:::@::: :::@:::::::::#:::::::@: 0 +---...--------------------------------------------------------->Mi 0 ... 188.4 Number of snapshots: 68 Detailed snapshots: [1, 3, 11, 19, 24, 25, 33, 37, 44, 55 (peak), 65] --------------------------------------------------------------------- Command: ./src/du.new -a /t/z Massif arguments: --massif-out-file=m.new ms_print arguments: m.new --------------------------------------------------------------------- MB 3.090^... # | ... # | ... # | ... # | ... # | ... # | ... @ # | ... @ # | ... @ #:: :: :: | ... @ #:::::@::::@::::@:: | ... @ @ #:::::@::::@::::@:: | ... @ @ #:::::@::::@::::@:: | ... @ @:::::::::::@::#:::::@::::@::::@:: | ... @ @ @:::::::::::@::#:::::@::::@::::@:: | ... @ @:::::::::@@:::::::::::@::#:::::@::::@::::@:: | ... @ @::::: :::@@:::::::::::@::#:::::@::::@::::@:: | ... @:@::@@:@::::: :::@@:::::::::::@::#:::::@::::@::::@:: | ...::@:@:@: @ :@::::: :::@@:::::::::::@::#:::::@::::@::::@:: | ::...::@:@:@: @ :@::::: :::@@:::::::::::@::#:::::@::::@::::@:: | ::...::@:@:@: @ :@::::: :::@@:::::::::::@::#:::::@::::@::::@:: 0 +---...--------------------------------------------------------->Mi 0 ... 184.8 Number of snapshots: 96 Detailed snapshots: [16, 18, 20, 24, 26, 38, 39, 56, 61 (peak), 71, 81, 91] du: INITIAL_DI_SET_SIZE = 1021 --- src/du.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/src/du.c b/src/du.c index 5838085..d88a4f3 100644 --- a/src/du.c +++ b/src/du.c @@ -61,7 +61,7 @@ extern bool fts_debug; #endif /* Initial size of the hash table. */ -enum { INITIAL_DI_SET_SIZE = 103 }; +enum { INITIAL_DI_SET_SIZE = 1021 }; /* A set of dev/ino pairs. */ static struct di_set_state di_set; -- 1.7.1.764.g84d391 From 9fdda7b3dfccbac42b1edd91124e2346322d445a Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Sat, 13 Dec 2008 16:33:15 +0100 Subject: [PATCH 6/7] mem-usage-compare: new script --- mem-usage-compare | 18 ++++++++++++++++++ 1 files changed, 18 insertions(+), 0 deletions(-) create mode 100755 mem-usage-compare diff --git a/mem-usage-compare b/mem-usage-compare new file mode 100755 index 0000000..c7d31e7 --- /dev/null +++ b/mem-usage-compare @@ -0,0 +1,18 @@ +#!/bin/sh +prog=$1 +shift +args=$@ + +old=./$prog.old +new=./$prog.new + +$old --version || exit 1 +$new --version || exit 1 + +valgrind --tool=massif --massif-out-file=m.old -- $old $args > _m.out.old +valgrind --tool=massif --massif-out-file=m.new -- $new $args > _m.out.new + +diff -u _m.out.old _m.out.new || exit 1 + +ms_print m.old | head -40 +ms_print m.new | head -40 -- 1.7.1.764.g84d391 From 267f06ab788634a3a6a9bf2b1f467f9e974fd292 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Sat, 12 Jun 2010 19:08:38 +0200 Subject: [PATCH 7/7] FIXME: show how new du uses less than half the memory --- README-di-set | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 82 insertions(+), 0 deletions(-) create mode 100644 README-di-set diff --git a/README-di-set b/README-di-set new file mode 100644 index 0000000..bcf46a0 --- /dev/null +++ b/README-di-set @@ -0,0 +1,82 @@ +On an x86_64 system: + +-------------------------------------------------------------------------------- +Command: ./src/du.old -a /t/z +Massif arguments: --massif-out-file=m.old +ms_print arguments: m.old +-------------------------------------------------------------------------------- + + + MB +5.618^ # + | # + | # + | # + | # + | # ::: + | @ #:::::::: + | @ #:::::::: + | @ ::#:::::::: + | @ :@:::::@::#:::::::@ + | @ @:::@::::@:::::@::#:::::::@ + | @ @:::@::::@:::::@::#:::::::@ + | @ ::::@::::@:::@::::@:::::@::#:::::::@ + | @@ @:::: ::@::::@:::@::::@:::::@::#:::::::@ + | @ :::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @ @ :::::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @::::::@ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | ::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | @ :@:::::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + | ::@@:::@:: ::@:: :: @ :: ::::@: :: ::@::::@:::@::::@:::::@::#:::::::@ + 0 +----------------------------------------------------------------------->Mi + 0 189.6 + +Number of snapshots: 69 + Detailed snapshots: [4, 5, 9, 14, 19, 26, 32, 37, 42, 47, 54, 57 (peak), 67] + +-------------------------------------------------------------------------------- + n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) +-------------------------------------------------------------------------------- + 0 0 0 0 0 0 + 1 2,515,906 185,176 170,412 14,764 0 + 2 5,834,589 258,000 231,344 26,656 0 +-------------------------------------------------------------------------------- +Command: ./src/du.new -a /t/z +Massif arguments: --massif-out-file=m.new +ms_print arguments: m.new +-------------------------------------------------------------------------------- + + + MB +3.145^ # + | # + | # + | # + | # + | # + | @@ # + | @ # + | @ # :: + | @ #:::::@::::@:::::@ + | @ @ #:::::@::::@:::::@ + | @ @ #:::::@::::@:::::@ + | @ @ ::::::::@:::::#:::::@::::@:::::@ + | @@ @ @ ::::: ::@:::::#:::::@::::@:::::@ + | @ @::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ @::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | :::@@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | @ :::@:: @@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + | :@::@@@:::@:: @@@ ::::::@::::::::::@ ::::: ::@:::::#:::::@::::@:::::@ + 0 +----------------------------------------------------------------------->Mi + 0 188.8 + +Number of snapshots: 93 + Detailed snapshots: [4, 7, 8, 9, 14, 18, 19, 21, 29, 40, 51, 60 (peak), 70, 80, 90] + +-------------------------------------------------------------------------------- + n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) +-------------------------------------------------------------------------------- + 0 0 0 0 0 0 + 1 3,286,259 134,120 129,254 4,866 0 + 2 5,525,860 226,432 218,599 7,833 0 -- 1.7.1.764.g84d391
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 28 Jun 2010 01:18:03 GMT) Full text and rfc822 format available.Message #8 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Pádraig Brady <P <at> draigBrady.com> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Mon, 28 Jun 2010 02:15:40 +0100
On 27/06/10 23:09, Jim Meyering wrote: > FYI, > This is just a preliminary heads-up. At least one detail must be > addressed before I push: I haven't written justification for the > change to hash.c (I confess that I don't remember -- parts of this > change date back to Dec 2008 ;-) I tried to test the speed of this compared to the original, but got a consistent abort() after doing: cd /other/dev/gcc cp -al gcc-4.5.0 gcc-4.5.0.hl cd ~/git/coreutils ./src/du -hs /other/dev/gcc Setting breakpoint at hash.c:1077 is just before it calls abort() hash_find_entry (table=0x8060ac8, entry=0x810201dd, bucket_head=0xbfffef9c, delete=false) at hash.c:788 di_ent_compare (x=0x810201dd, y=0x81fc7a95) == true shout if you need anything else. cheers, Pádraig.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 28 Jun 2010 08:06:01 GMT) Full text and rfc822 format available.Message #11 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Pádraig Brady <P <at> draigBrady.com> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Mon, 28 Jun 2010 10:05:44 +0200
Pádraig Brady wrote: > On 27/06/10 23:09, Jim Meyering wrote: >> FYI, >> This is just a preliminary heads-up. At least one detail must be >> addressed before I push: I haven't written justification for the >> change to hash.c (I confess that I don't remember -- parts of this >> change date back to Dec 2008 ;-) > > I tried to test the speed of this compared to the original, > but got a consistent abort() after doing: > > cd /other/dev/gcc > cp -al gcc-4.5.0 gcc-4.5.0.hl > cd ~/git/coreutils > ./src/du -hs /other/dev/gcc > > Setting breakpoint at hash.c:1077 is just before it calls abort() > hash_find_entry (table=0x8060ac8, entry=0x810201dd, bucket_head=0xbfffef9c, delete=false) at hash.c:788 > di_ent_compare (x=0x810201dd, y=0x81fc7a95) == true > > shout if you need anything else. Thank you for the quick testing! So far I've been unable to spot the cause or to reproduce that, but since the way the new code works depends on the range of inode number values, that's not too surprising. You have some very large inode numbers in that hierarchy, and that's exercising code that is not run in my tests. I ran this in a directory with a few hundred cloned/checked-out source directories: for i in *; do echo $i; cp -al $i .k; /cu/src/du -hs $i || break; find .k -type d -exec chmod 700 '{}' + ; rm -rf .k;done and du never aborted. (that was on an x86_64/F13 system) What type of file system are you using? How big is it? Fedora 13? i686 or x86_64? It may help if you can print the "a" and "b" structs that are compared in di_ent_compare. If I haven't made a mistake in e.g., di_ent_hash, di_ent_compare, decode_ptr, etc., then the trouble may be with hash.c itself, via its rehashing algorithm. The offending abort is essentially making this assertion: We did a lookup for "entry" before the rehash, and it wasn't there. Now, after the rehash, determine which bucket "entry" hashes to, and by the way, if we find a matching entry in the table, then abort. BTW, how did you build? I used ./configure --enable-gcc-warnings --disable-nls --prefix=/p For good measure I repeated the test above on an i686/F13 system. Still no failure. What is in your /other/dev/gcc/gcc-4.5.0 directory, other than an unpacked official 4.5.0 tarball? I tried to reproduce like this, on both tmpfs and ext4, and on i686 and x86_64: mkdir gcc && cd gcc tar xf ../gcc-4.5.0.tar.bz2 cp -al gcc-4.5.0 gcc-4.5.0.hl ...path-to-just-built/src/du -sh $PWD It always succeeds like this: 586M /home/meyering/gcc
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 28 Jun 2010 11:11:01 GMT) Full text and rfc822 format available.Message #14 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Pádraig Brady <P <at> draigBrady.com> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Mon, 28 Jun 2010 12:08:33 +0100
On 28/06/10 09:05, Jim Meyering wrote: > Thank you for the quick testing! > So far I've been unable to spot the cause or to reproduce that, but since > the way the new code works depends on the range of inode number values, > that's not too surprising. You have some very large inode numbers in > that hierarchy, and that's exercising code that is not run in my tests. > > It always succeeds like this: > > 586M /home/meyering/gcc What I did exactly was: ./bootstrap --gnulib-srcdir=/home/padraig/git/gnulib; ./configure --enable-gcc-warnings; make However now having worked on another branch and switched back and repeated the above, I can't reproduce :( I'd chalk this down to build issues though I can't see what's changed. $ ./src/du -hs ~/f8/gcc 2.9G /home/padraig/f8/gcc $ find ~/f8/gcc -printf "%D:%i\n" | sort -t: -k2,2n | tail -n10 2056:9765506 2056:9765506 2056:9765507 2056:9765507 2056:9765511 2056:9765511 2056:9765512 2056:9765512 2056:9806497 2056:9806497 $ df -T ~/f8/gcc Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/sda8 ext3 38593360 34356772 3439768 91% /old_home There is no significant difference in speed between the old new and your new memory efficient one. cheers, Pádraig.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 28 Jun 2010 11:58:02 GMT) Full text and rfc822 format available.Message #17 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Pádraig Brady <P <at> draigBrady.com> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Mon, 28 Jun 2010 13:56:50 +0200
Pádraig Brady wrote: > On 28/06/10 09:05, Jim Meyering wrote: >> Thank you for the quick testing! >> So far I've been unable to spot the cause or to reproduce that, but since >> the way the new code works depends on the range of inode number values, >> that's not too surprising. You have some very large inode numbers in >> that hierarchy, and that's exercising code that is not run in my tests. >> >> It always succeeds like this: >> >> 586M /home/meyering/gcc > > What I did exactly was: > ./bootstrap --gnulib-srcdir=/home/padraig/git/gnulib; ./configure --enable-gcc-warnings; make > > However now having worked on another branch and switched back > and repeated the above, I can't reproduce :( Hmm... mixed feelings. Glad there appears to be no problem, yet concerned that a problem may be lurking... > I'd chalk this down to build issues though I can't see what's changed. > > $ ./src/du -hs ~/f8/gcc > 2.9G /home/padraig/f8/gcc > > $ find ~/f8/gcc -printf "%D:%i\n" | sort -t: -k2,2n | tail -n10 > 2056:9765506 > 2056:9765506 > 2056:9765507 > 2056:9765507 > 2056:9765511 > 2056:9765511 > 2056:9765512 > 2056:9765512 > 2056:9806497 > 2056:9806497 > > $ df -T ~/f8/gcc > Filesystem Type 1K-blocks Used Available Use% Mounted on > /dev/sda8 ext3 38593360 34356772 3439768 91% /old_home > > There is no significant difference in speed between > the old new and your new memory efficient one. Thanks again for testing.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Thu, 01 Jul 2010 22:40:04 GMT) Full text and rfc822 format available.Message #20 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Thu, 01 Jul 2010 15:39:32 -0700
Re the patch you just reported in: <http://lists.gnu.org/archive/html/bug-gnulib/2010-07/msg00005.html> I assume this is to support the du performance improvement that was proposed in <http://debbugs.gnu.org/db/65/6524.html>. I see that you incorporated a further improvement in the gnulib patch, namely, support for NULL keys. The gnulib change seems fine, but I noticed some problems with the coreutils part of that earlier proposal. Among other things, it makes du dump core on a (large) test case that I have locally. I don't know why (perhaps there were further fixes to the gnulib part that I didn't get right?). I found out about the core dump only because I had independently prepared a patch that I think is better: it uses a bit less memory and is quite a bit simpler. I'll try to merge my patch with your gnulib change and send it off to bug-coreutils in the next day or two. My patch, by the way, doesn't hash pointer keys: it just uses size_t values and casts them to void * (which is what the hash package wants). This trick works on all architectures that I know about, but it isn't guaranteed by C or POSIX and the casts are a bit offputting, so it'd be nice if the hash package supported hashing size_t keys directly. That's lower priority, though.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Thu, 01 Jul 2010 23:00:04 GMT) Full text and rfc822 format available.Message #23 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Fri, 02 Jul 2010 00:59:09 +0200
Paul Eggert wrote: > Re the patch you just reported in: > <http://lists.gnu.org/archive/html/bug-gnulib/2010-07/msg00005.html> > > I assume this is to support the du performance improvement that was proposed in > <http://debbugs.gnu.org/db/65/6524.html>. Right. That's the same as this: http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/20857 > I see that you incorporated a further > improvement in the gnulib patch, namely, support for NULL keys. > > The gnulib change seems fine, but I noticed some problems with the coreutils > part of that earlier proposal. Among other things, it makes du dump core on a > (large) test case that I have locally. I don't know why (perhaps there > were further fixes to the gnulib part that I didn't get right?). I found Thanks for testing those. Can you give me a backtrace? Even if your patch is clearly better, I'd like to know how my code is malfunctioning. > out about the core dump only because I had independently prepared a patch > that I think is better: it uses a bit less memory and is quite a bit simpler. > I'll try to merge my patch with your gnulib change and send it off to > bug-coreutils in the next day or two. > > My patch, by the way, doesn't hash pointer keys: it just uses size_t values > and casts them to void * (which is what the hash package wants). This trick > works on all architectures that I know about, but it isn't guaranteed by > C or POSIX and the casts are a bit offputting, Using unions is one way to avoid casts (and certain warnings). > so it'd be nice if the hash > package supported hashing size_t keys directly. That's lower priority, though.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 00:08:02 GMT) Full text and rfc822 format available.Message #26 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Thu, 01 Jul 2010 17:07:00 -0700
On 07/01/10 15:59, Jim Meyering wrote: > Can you give me a backtrace? > Even if your patch is clearly better, I'd like to know > how my code is malfunctioning. Sure. This uses your older patch, not the newer gnulib one. Program received signal SIGABRT, Aborted. 0x0012d422 in __kernel_vsyscall () (gdb) where #0 0x0012d422 in __kernel_vsyscall () #1 0x00158651 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #2 0x0015ba82 in *__GI_abort () at abort.c:92 #3 0x08054826 in hash_insert0 (table=0x805f208, entry=0x8017384d, matched_ent=0x0) at hash.c:1078 #4 0x0804bdba in di_set_insert (dis=0x805e2f4, dev=2049, ino=380435) at di-set.c:264 #5 0x0804ab0f in hash_ins (argc=52, argv=0xbffff304) at du.c:335 #6 process_file (argc=52, argv=0xbffff304) at du.c:467 #7 du_files (argc=52, argv=0xbffff304) at du.c:592 #8 main (argc=52, argv=0xbffff304) at du.c:962 The test case, by the way, is "./du ../../cu*", where the globbing pattern expands to 51 directories, each a copy of coreutils, with several hard links because many are git clones of each other. Arf arf! (I'm eating my own dog food.) > Using unions is one way to avoid casts (and certain warnings). Yes; but ouch! That's jumping into the fire! I suspect it's even less likely to work on weird architectures. I'd rather stay in the frying pan.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 03:28:02 GMT) Full text and rfc822 format available.Message #29 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Eric Blake <eblake <at> redhat.com> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org, Bruno Haible <bruno <at> clisp.org>, José Marchesi <jemarch <at> gnu.org> Subject: Re: RFC: modules for generic unordered sets and mappings Date: Thu, 01 Jul 2010 21:26:38 -0600
[Message part 1 (text/plain, inline)]
On 07/01/2010 06:44 PM, Paul Eggert wrote: > I've just gone through "du", and it's fresh in my > mind, so I thought I'd run through what I needed there, to see how > well it maps to the proposal. > > I needed three kinds of hash tables. > > * Table A is indexed by device number (a dev_t value) and the value is a > secondary hash table. Typically the number of devices is relatively > small. > > * Table B is indexed by inode number (an ino_t value) and has no value. > All that matters is whether the key is present. So it is a set, > not a general mapping. Often there are many, many inode numbers. > > * The gnulib hash package uses void * keys, but having a void * value > that exists only to point to an ino_t value wastes memory. So, if > an inode number is small (less than M, say), it maps to itself. If > it is large, it is a key into a third table (table C) whose values > are mapped inode numbers. (Values in table C are always M or > greater.) The mapped inode number (whether taken from table C, or > used directly) is cast to void * and then used as the index into > table B. In practice, most inode numbers are less than M so this is > a storage win. We need to be careful on cygwin. $ ls -i | head 1125899907178954 ABOUT-NLS 28147497671067760 AUTHORS 1970324837308495 COPYING 9851624184873962 ChangeLog 4785074604197847 ChangeLog-2005 1970324837180710 ChangeLog-2006 1970324837078885 ChangeLog-2007 2251799813904421 ChangeLog-2008 281474977732635 GNUmakefile 1125899907781258 GNUmakefile~ sizeof(void*) == sizeof(size_t) == 4, but sizeof(ino_t) == 8, and most inodes are quite randomly dispersed but definitely larger than 4 bytes. Does your scheme work well at mapping cygwin's 8-byte inodes into 4-byte pointers? -- Eric Blake eblake <at> redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 06:14:02 GMT) Full text and rfc822 format available.Message #32 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Eric Blake <eblake <at> redhat.com> Cc: Bruno Haible <bruno <at> clisp.org>, José Marchesi <jemarch <at> gnu.org>, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: RFC: modules for generic unordered sets and mappings Date: Thu, 01 Jul 2010 23:13:11 -0700
On 07/01/10 20:26, Eric Blake wrote: > sizeof(void*) == sizeof(size_t) == 4, but sizeof(ino_t) == 8, and most > inodes are quite randomly dispersed but definitely larger than 4 bytes. > Does your scheme work well at mapping cygwin's 8-byte inodes into > 4-byte pointers? For those inode numbers the scheme uses table C, i.e., it hashes them into smaller integers. This won't work as well as file systems with "small" inode numbers (small, on 32-bit machines, means inode numbers less than 2**31), but it should still work better than du does now.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 08:22:01 GMT) Full text and rfc822 format available.Message #35 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Eric Blake <eblake <at> redhat.com> Cc: Paul Eggert <eggert <at> CS.UCLA.EDU>, bug-gnulib <at> gnu.org, Jos, 6524 <at> debbugs.gnu.org, Bruno Haible <bruno <at> clisp.org>, é Marchesi <jemarch <at> gnu.org> Subject: Re: RFC: modules for generic unordered sets and mappings Date: Fri, 02 Jul 2010 10:21:25 +0200
Eric Blake wrote: > On 07/01/2010 06:44 PM, Paul Eggert wrote: >> I've just gone through "du", and it's fresh in my >> mind, so I thought I'd run through what I needed there, to see how >> well it maps to the proposal. >> >> I needed three kinds of hash tables. >> >> * Table A is indexed by device number (a dev_t value) and the value is a >> secondary hash table. Typically the number of devices is relatively >> small. >> >> * Table B is indexed by inode number (an ino_t value) and has no value. >> All that matters is whether the key is present. So it is a set, >> not a general mapping. Often there are many, many inode numbers. >> >> * The gnulib hash package uses void * keys, but having a void * value >> that exists only to point to an ino_t value wastes memory. So, if >> an inode number is small (less than M, say), it maps to itself. If >> it is large, it is a key into a third table (table C) whose values >> are mapped inode numbers. (Values in table C are always M or >> greater.) The mapped inode number (whether taken from table C, or >> used directly) is cast to void * and then used as the index into >> table B. In practice, most inode numbers are less than M so this is >> a storage win. > > We need to be careful on cygwin. > > $ ls -i | head > 1125899907178954 ABOUT-NLS > 28147497671067760 AUTHORS > 1970324837308495 COPYING > 9851624184873962 ChangeLog > 4785074604197847 ChangeLog-2005 > 1970324837180710 ChangeLog-2006 > 1970324837078885 ChangeLog-2007 > 2251799813904421 ChangeLog-2008 > 281474977732635 GNUmakefile > 1125899907781258 GNUmakefile~ > > sizeof(void*) == sizeof(size_t) == 4, but sizeof(ino_t) == 8, and most > inodes are quite randomly dispersed but definitely larger than 4 bytes. > Does your scheme work well at mapping cygwin's 8-byte inodes into > 4-byte pointers? It's better not to modify algorithms in order to optimize for systems with 4-byte pointers. We can find/use a set/hash package that can efficiently handle keys of type uintmax_t regardless of pointer size. Combine that with a slight re-design suggested by Paul and (before him, privately) by Tim Waugh -- to map device numbers to inode hash tables rather than to small integers that need to be encoded -- and you get all the benefits with much less complexity.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 11:00:04 GMT) Full text and rfc822 format available.Message #38 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: bug-gnulib <at> gnu.org, bug-coreutils <at> gnu.org, Pádraig Brady <P <at> draigBrady.com>, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Fri, 02 Jul 2010 12:58:50 +0200
Paul Eggert wrote: > On 07/01/10 15:59, Jim Meyering wrote: > >> Can you give me a backtrace? >> Even if your patch is clearly better, I'd like to know >> how my code is malfunctioning. > > Sure. This uses your older patch, not the newer gnulib one. I think they're functionally equivalent. The new one adds comments. > Program received signal SIGABRT, Aborted. > 0x0012d422 in __kernel_vsyscall () > (gdb) where > #0 0x0012d422 in __kernel_vsyscall () > #1 0x00158651 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #2 0x0015ba82 in *__GI_abort () at abort.c:92 > #3 0x08054826 in hash_insert0 (table=0x805f208, entry=0x8017384d, matched_ent=0x0) at hash.c:1078 > #4 0x0804bdba in di_set_insert (dis=0x805e2f4, dev=2049, ino=380435) at di-set.c:264 > #5 0x0804ab0f in hash_ins (argc=52, argv=0xbffff304) at du.c:335 > #6 process_file (argc=52, argv=0xbffff304) at du.c:467 > #7 du_files (argc=52, argv=0xbffff304) at du.c:592 > #8 main (argc=52, argv=0xbffff304) at du.c:962 > > The test case, by the way, is "./du ../../cu*", where the > globbing pattern expands to 51 directories, each a copy of coreutils, > with several hard links because many are git clones of each other. Ouch. That's the same assertion that Pádraig hit. But he was subsequently unable to reproduce it. It indicates that when attempting to insert the 2049,380435 dev,inode pair, hash_insert0 found that the pair was not yet in the table, and that the table was too full, so it first rehashed everything into a larger table, and then asserted this: Since ENTRY was not in the initial table, it had better not be in the rehashed table. That assertion failed, because somehow, ENTRY *was* in the new table. Knowing the initial table size and the new one, given the above, we might be able to determine what went wrong. I have tried numerous times on both x86_64 and i686 to reproduce it, but have never managed to trigger the failure. Can you run "print *table" in gdb, say in hash_insert0? Better still, set a conditional breakpoint in hash_insert0 to trigger when ENTRY == 0x8017384d (encoded version of 2049,380435), and print the entire table using a variant of hash_print, step past the rehash, and print the new table.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 19:18:02 GMT) Full text and rfc822 format available.Message #41 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Fri, 02 Jul 2010 21:17:20 +0200
Paul Eggert wrote: > On 07/01/10 15:59, Jim Meyering wrote: > >> Can you give me a backtrace? >> Even if your patch is clearly better, I'd like to know >> how my code is malfunctioning. > > Sure. This uses your older patch, not the newer gnulib one. > > Program received signal SIGABRT, Aborted. > 0x0012d422 in __kernel_vsyscall () > (gdb) where > #0 0x0012d422 in __kernel_vsyscall () > #1 0x00158651 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #2 0x0015ba82 in *__GI_abort () at abort.c:92 > #3 0x08054826 in hash_insert0 (table=0x805f208, entry=0x8017384d, matched_ent=0x0) at hash.c:1078 > #4 0x0804bdba in di_set_insert (dis=0x805e2f4, dev=2049, ino=380435) at di-set.c:264 > #5 0x0804ab0f in hash_ins (argc=52, argv=0xbffff304) at du.c:335 > #6 process_file (argc=52, argv=0xbffff304) at du.c:467 > #7 du_files (argc=52, argv=0xbffff304) at du.c:592 > #8 main (argc=52, argv=0xbffff304) at du.c:962 > > The test case, by the way, is "./du ../../cu*", where the > globbing pattern expands to 51 directories, each a copy of coreutils, > with several hard links because many are git clones of each other. > Arf arf! (I'm eating my own dog food.) Thanks again for the report. However, while I was able to reproduce it (on Paul's system) and debug it, it appears to be due to a miscompilation of di-set.o when using a private copy of gcc-4.5.0. When I recompiled that one file with gcc-Ubuntu 4.4.3-4ubuntu5 and -g -O2 or with 4.5.0 and -g -O, the code works as expected. The problem arose in the di_ent_compare function and its use of the tiny decode_ptr. For the record, here they are: static struct di_ent decode_ptr (struct di_ent const *v) { if (!is_encoded_ptr (v)) return *v; struct di_ent di; di.u.ptr = (void *) v; return di; } /* Compare two di_ent structs. Return true if they are the same. */ static bool di_ent_compare (void const *x, void const *y) { struct di_ent a = decode_ptr (x); struct di_ent b = decode_ptr (y); if (a.u.di4.mode != b.u.di4.mode) return false; if (a.u.di4.mode == DI_MODE_4) return (a.u.di4.short_ino == b.u.di4.short_ino && a.u.di4.mapped_dev == b.u.di4.mapped_dev); if (a.u.di8.mode == DI_MODE_8) return (a.u.di8.short_ino == b.u.di8.short_ino && a.u.di8.mapped_dev == b.u.di8.mapped_dev); return (a.u.full.ino == b.u.full.ino && a.u.full.dev == b.u.full.dev); } Even though they're obviously distinct encoded values (see the hexadecimal values below), di_ent_compare (aka table->comparator) reports they're equal when using gcc-4.5.0 -g -O2, and that causes the failed assertion: (gdb) p table->comparator (entry, bucket->data) $9 = true (gdb) p bucket->data $10 = (void *) 0x80143c4d (gdb) p entry $11 = (const void *) 0x80173851 I expect to push this series shortly. Paul, if you end up improving further, you're welcome to revert any pieces that are no longer used.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 22:09:02 GMT) Full text and rfc822 format available.Message #44 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: bug-coreutils <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Fri, 02 Jul 2010 15:07:54 -0700
(A circa-2008 patch! You're almost as backlogged as I am!) Anyway, here's the revised proposal for that, which I privately promised a day or two ago. It avoids using unions and bitfields, which are a bit hard to follow and apparently run into compiler bugs with GCC 4.5.0 on x86. In my measurements, this patch uses a bit less memory than the earlier patch, when the earlier patch worked for me; but this is highly test-dependent and it would help to try this out on other test cases. This patch continues to treat inode 0 specially, to work around a problem with hash.c. Recently hash.c was changed to no longer reject NULL keys, but its internals still depend on not allowing NULL keys so this patch is careful to never create a key by casting 0 to (void *). The assumption is that 0 is the only size_t value i such that ((void *) i == NULL); this assumption is true for all practical architectures that I know about, though it is not guaranteed by any standard. One other thing: the gnulib (or prototype gnulib) stuff I converted to C90; I assume that's still a constraint for gnulib (though not for coreutils proper). This was mostly just moving decls to the starts of blocks. Change 'du' to use much less memory to track hard links. * NEWS: Mention this. * bootstrap.conf (gnulib_modules): Add di-set. * gl/lib/di-set.c, gl/lib/di-set.h: New files. * gl/lib/ino-map.c, gl/lib/ino-map.h: New files. * gl/modules/di-set, gl/modules/di-set-tests: New files. * gl/modules/ino-map, gl/modules/ino-map-tests: New files. * gl/tests/test-di-set.c, gl/tests/test-ino-map.c: New files. * src/du.c: Include di-set.h rather than hash.h. (INITIAL_TABLE_SIZE, struct entry, htab, entry_hash, entry_compare): (hash_init): Remove. (di_set): New static var. (hash_ins): Just use di_set_insert. (main): Use di_set_alloc and di_set_free to allocate and free tables, rather than the old hash_init and hash_free. diff --git a/NEWS b/NEWS index 3a24925..2493ef8 100644 --- a/NEWS +++ b/NEWS @@ -10,6 +10,10 @@ GNU coreutils NEWS -*- outline -*- du recognizes -d N as equivalent to --max-depth=N, for compatibility with FreeBSD. + du now uses less than half as much memory when operating on trees + with many hard-linked files. With --count-links (-l), or when + operating on trees with no hard-linked files, there is no change. + sort now accepts the --debug option, to highlight the part of the line significant in the sort, and warn about questionable options. diff --git a/bootstrap.conf b/bootstrap.conf index 6e85c9a..644c18b 100644 --- a/bootstrap.conf +++ b/bootstrap.conf @@ -69,6 +69,7 @@ gnulib_modules=" cycle-check d-ino d-type + di-set diacrit dirfd dirname diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c new file mode 100644 index 0000000..76baf54 --- /dev/null +++ b/gl/lib/di-set.c @@ -0,0 +1,232 @@ +/* Set operations for device-inode pairs stored in a space-efficient manner. + + Copyright 2009-2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Paul Eggert and Jim Meyering */ + +#include <config.h> +#include "di-set.h" + +#include "hash.h" +#include "ino-map.h" + +#include <stdint.h> +#include <stdlib.h> + +/* The hash package hashes "void *", but this package wants to hash + integers. Use integers that are as large as possible, but no + larger than void *, so that they can be cast to void * and back + without losing information. */ +typedef size_t hashint; +#define HASHINT_MAX ((hashint) -1) + +/* Integers represent inode numbers. Integers in the range + 1..(LARGE_INO_MIN-1) represent inode numbers directly. (The hash + package does not work with null pointers, so inode 0 cannot be used + as a key.) To find the representations of other inode numbers, map + them through INO_MAP. */ +#define LARGE_INO_MIN (HASHINT_MAX / 2) + +/* Set operations for device-inode pairs stored in a space-efficient + manner. Use a two-level hash table. The top level hashes by + device number, as there are typically a small number of devices. + The lower level hashes by mapped inode numbers. In the typical + case where the inode number is positive and small, the inode number + maps to itself, masquerading as a void * value; otherwise, its + value is the result of hashing the inode value through INO_MAP. */ + +/* A pair that maps a device number to a set of inode numbers. */ +struct di_ent +{ + dev_t dev; + struct hash_table *ino_set; +}; + +/* A two-level hash table that manages and indexes these pairs. */ +struct di_set +{ + /* Map device numbers to sets of inode number representatives. */ + struct hash_table *dev_map; + + /* If nonnull, map large inode numbers to their small + representatives. If null, there are no large inode numbers in + this set. */ + struct ino_map *ino_map; + + /* Cache of the most recently allocated and otherwise-unused storage + for probing this table. */ + struct di_ent *probe; +}; + +/* Hash a directory set entry. */ +static size_t +di_ent_hash (void const *x, size_t table_size) +{ + struct di_ent const *p = x; + + /* Use ordinary % if it's known to return a nonnegative value. + Otherwise, fall back on uintmax_t %, which could be much slower. */ + if ((dev_t) -1 % (size_t) 2 == 1) + return p->dev % table_size; + else + { + uintmax_t dev = p->dev; + return dev % table_size; + } +} + +/* Return true if two directory set entries are the same. */ +static bool +di_ent_compare (void const *x, void const *y) +{ + struct di_ent const *a = x; + struct di_ent const *b = y; + return a->dev == b->dev; +} + +/* Free a directory set entry. */ +static void +di_ent_free (void *v) +{ + struct di_ent *a = v; + hash_free (a->ino_set); + free (a); +} + +/* Create a set of device-inode pairs. Return NULL on allocation failure. */ +struct di_set * +di_set_alloc (void) +{ + struct di_set *dis = malloc (sizeof *dis); + if (dis) + { + enum { INITIAL_DEV_MAP_SIZE = 11 }; + dis->dev_map = hash_initialize (INITIAL_DEV_MAP_SIZE, NULL, + di_ent_hash, di_ent_compare, + di_ent_free); + if (! dis->dev_map) + { + free (dis); + return NULL; + } + dis->ino_map = NULL; + dis->probe = NULL; + } + + return dis; +} + +/* Free a set of device-inode pairs. */ +void +di_set_free (struct di_set *dis) +{ + hash_free (dis->dev_map); + free (dis->ino_map); + free (dis->probe); + free (dis); +} + +/* Hash an encoded inode number I. */ +static size_t +di_ino_hash (void const *i, size_t table_size) +{ + return (hashint) i % table_size; +} + +/* Using the DIS table, map a device to a hash table that represents + a set of inode numbers. Return NULL on error. */ +static struct hash_table * +map_device (struct di_set *dis, dev_t dev) +{ + /* Find space for the probe, reusing the cache if available. */ + struct di_ent *ent; + struct di_ent *probe = dis->probe; + if (probe) + { + /* If repeating a recent query, return the cached result. */ + if (probe->dev == dev) + return probe->ino_set; + } + else + { + dis->probe = probe = malloc (sizeof *probe); + if (! probe) + return NULL; + } + + /* Probe for the device. */ + probe->dev = dev; + ent = hash_insert (dis->dev_map, probe); + if (! ent) + return NULL; + + if (ent == probe) + { + enum { INITIAL_INO_SET_SIZE = 1021 }; + + /* Prepare to allocate a new probe next time; this one is in use. */ + dis->probe = NULL; + + /* DEV is new; allocate an inode set for it. */ + probe->ino_set = hash_initialize (INITIAL_INO_SET_SIZE, NULL, + di_ino_hash, NULL, NULL); + } + else + probe->ino_set = ent->ino_set; + + return probe->ino_set; +} + +/* Using the DIS table, map an inode number to a mapped value. + Return INO_MAP_INSERT_FAILURE on error. */ +static hashint +map_inode_number (struct di_set *dis, ino_t ino) +{ + if (0 < ino && ino < LARGE_INO_MIN) + return ino; + + if (! dis->ino_map) + { + dis->ino_map = ino_map_alloc (LARGE_INO_MIN); + if (! dis->ino_map) + return INO_MAP_INSERT_FAILURE; + } + + return ino_map_insert (dis->ino_map, ino); +} + +/* Attempt to insert the DEV,INO pair into the set DIS. + If it matches a pair already in DIS, keep that pair and return 0. + Otherwise, if insertion is successful, return 1. + Upon any failure return -1. */ +int +di_set_insert (struct di_set *dis, dev_t dev, ino_t ino) +{ + hashint i; + + /* Map the device number to a set of inodes. */ + struct hash_table *ino_set = map_device (dis, dev); + if (! ino_set) + return -1; + + /* Map the inode number to a small representative I. */ + i = map_inode_number (dis, ino); + if (i == INO_MAP_INSERT_FAILURE) + return -1; + + /* Put I into the inode set. */ + return hash_insert0 (ino_set, (void *) i, NULL); +} diff --git a/gl/lib/di-set.h b/gl/lib/di-set.h new file mode 100644 index 0000000..d30a31e --- /dev/null +++ b/gl/lib/di-set.h @@ -0,0 +1,12 @@ +#include <sys/types.h> + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m) +#endif + +struct di_set *di_set_alloc (void); +int di_set_insert (struct di_set *, dev_t, ino_t) _ATTRIBUTE_NONNULL_ (1); +void di_set_free (struct di_set *) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c new file mode 100644 index 0000000..083b49d --- /dev/null +++ b/gl/lib/ino-map.c @@ -0,0 +1,159 @@ +/* Map an ino_t inode number to a small integer. + + Copyright 2009, 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Paul Eggert and Jim Meyering */ + +#include <config.h> +#include "ino-map.h" + +#include "hash.h" +#include "verify.h" + +#include <stdint.h> +#include <stdlib.h> + +/* A pair that maps an inode number to a mapped inode number; the + latter is a small unique ID for the former. */ +struct ino_map_ent +{ + ino_t ino; + size_t mapped_ino; +}; + +/* A table that manages and indexes these pairs. */ +struct ino_map +{ + /* A table of KEY,VAL pairs, where KEY is the raw ino_t value and + VAL is the small number that it maps to. */ + struct hash_table *map; + + /* The next mapped inode number to hand out. */ + size_t next_mapped_ino; + + /* Cache of the most recently allocated and otherwise-unused storage + for probing the table. */ + struct ino_map_ent *probe; +}; + +/* Hash an inode map entry. */ +static size_t +ino_hash (void const *x, size_t table_size) +{ + struct ino_map_ent const *p = x; + + /* Use ordinary % if it's known to return a nonnegative value. + Otherwise, fall back on uintmax_t %, which could be much slower. */ + if ((ino_t) -1 % (size_t) 2 == 1) + return p->ino % table_size; + else + { + uintmax_t ino = p->ino; + return ino % table_size; + } +} + +/* Return true if two inode map entries are the same. */ +static bool +ino_compare (void const *x, void const *y) +{ + struct ino_map_ent const *a = x; + struct ino_map_ent const *b = y; + return a->ino == b->ino; +} + +/* Allocate an inode map that will hand out integers starting with + NEXT_MAPPED_INO. Return NULL if memory is exhausted. */ +struct ino_map * +ino_map_alloc (size_t next_mapped_ino) +{ + struct ino_map *im = malloc (sizeof *im); + + if (im) + { + enum { INITIAL_INO_MAP_TABLE_SIZE = 1021 }; + im->map = hash_initialize (INITIAL_INO_MAP_TABLE_SIZE, NULL, + ino_hash, ino_compare, free); + if (! im->map) + { + free (im); + return NULL; + } + im->next_mapped_ino = next_mapped_ino; + im->probe = NULL; + } + + return im; +} + +/* Free an inode map. */ +void +ino_map_free (struct ino_map *map) +{ + hash_free (map->map); + free (map->probe); + free (map); +} + + +/* Insert into MAP the inode number INO if it's not there already, + and return its nonnegative mapped inode number. + If INO is already in MAP, return the existing mapped inode number. + Return INO_MAP_INSERT_FAILURE on memory or counter exhaustion. */ +size_t +ino_map_insert (struct ino_map *im, ino_t ino) +{ + struct ino_map_ent *ent; + + /* Find space for the probe, reusing the cache if available. */ + struct ino_map_ent *probe = im->probe; + if (probe) + { + /* If repeating a recent query, return the cached result. */ + if (probe->ino == ino) + return probe->mapped_ino; + } + else + { + im->probe = probe = malloc (sizeof *probe); + if (! probe) + return INO_MAP_INSERT_FAILURE; + } + + probe->ino = ino; + ent = hash_insert (im->map, probe); + if (! ent) + return INO_MAP_INSERT_FAILURE; + + if (ent == probe) + { + /* If adding 1 to map->next_mapped_ino would cause it to + overflow to zero, then it must equal INO_MAP_INSERT_FAILURE, + which is value that should be returned in that case. Verify + that this works. */ + verify (INO_MAP_INSERT_FAILURE + 1 == 0); + + /* Prepare to allocate a new probe next time; this one is in use. */ + im->probe = NULL; + + /* INO is new; allocate a mapped inode number for it. */ + probe->mapped_ino = im->next_mapped_ino++; + } + else + probe->mapped_ino = ent->mapped_ino; + + return probe->mapped_ino; +} diff --git a/gl/lib/ino-map.h b/gl/lib/ino-map.h new file mode 100644 index 0000000..661b78a --- /dev/null +++ b/gl/lib/ino-map.h @@ -0,0 +1,14 @@ +#include <sys/types.h> + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m) +#endif + +#define INO_MAP_INSERT_FAILURE ((size_t) -1) + +struct ino_map *ino_map_alloc (size_t); +void ino_map_free (struct ino_map *) _ATTRIBUTE_NONNULL_ (1); +size_t ino_map_insert (struct ino_map *, ino_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/modules/di-set b/gl/modules/di-set new file mode 100644 index 0000000..562db14 --- /dev/null +++ b/gl/modules/di-set @@ -0,0 +1,24 @@ +Description: +manipulate sets of device-inode pairs efficiently + +Files: +lib/di-set.c +lib/di-set.h + +Depends-on: +ino-map +hash + +configure.ac: + +Makefile.am: +lib_SOURCES += di-set.c di-set.h + +Include: +"di-set.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/di-set-tests b/gl/modules/di-set-tests new file mode 100644 index 0000000..d60f7fd --- /dev/null +++ b/gl/modules/di-set-tests @@ -0,0 +1,10 @@ +Files: +tests/test-di-set.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-di-set +check_PROGRAMS += test-di-set diff --git a/gl/modules/ino-map b/gl/modules/ino-map new file mode 100644 index 0000000..06ee51d --- /dev/null +++ b/gl/modules/ino-map @@ -0,0 +1,24 @@ +Description: +maintain a mapping of ino_t numbers to small integers + +Files: +lib/ino-map.c +lib/ino-map.h + +Depends-on: +hash +verify + +configure.ac: + +Makefile.am: +lib_SOURCES += ino-map.c ino-map.h + +Include: +"ino-map.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/ino-map-tests b/gl/modules/ino-map-tests new file mode 100644 index 0000000..fa10b4f --- /dev/null +++ b/gl/modules/ino-map-tests @@ -0,0 +1,10 @@ +Files: +tests/test-ino-map.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-ino-map +check_PROGRAMS += test-ino-map diff --git a/gl/tests/test-di-set.c b/gl/tests/test-di-set.c new file mode 100644 index 0000000..e5fb6cb --- /dev/null +++ b/gl/tests/test-di-set.c @@ -0,0 +1,64 @@ +/* Test the di-set module. + Copyright (C) 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* Written by Jim Meyering. */ + +#include <config.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> +#include <stdint.h> + +#define ASSERT(expr) \ + do \ + { \ + if (!(expr)) \ + { \ + fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ + fflush (stderr); \ + abort (); \ + } \ + } \ + while (0) + +#include "di-set.h" + +int +main (void) +{ + /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ + struct di_set *dis = di_set_alloc (); + ASSERT (dis); + + ASSERT (di_set_insert (dis, 2, 5) == 1); /* first insertion succeeds */ + ASSERT (di_set_insert (dis, 2, 5) == 0); /* duplicate fails */ + ASSERT (di_set_insert (dis, 3, 5) == 1); /* diff dev, duplicate inode is ok */ + ASSERT (di_set_insert (dis, 2, 8) == 1); /* same dev, different inode is ok */ + + /* very large (or negative) inode number */ + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 1); + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 0); /* dup */ + + unsigned int i; + for (i = 0; i < 3000; i++) + ASSERT (di_set_insert (dis, 9, i) == 1); + for (i = 0; i < 3000; i++) + ASSERT (di_set_insert (dis, 9, i) == 0); /* duplicate fails */ + + di_set_free (dis); + + return 0; +} diff --git a/gl/tests/test-ino-map.c b/gl/tests/test-ino-map.c new file mode 100644 index 0000000..2b44602 --- /dev/null +++ b/gl/tests/test-ino-map.c @@ -0,0 +1,63 @@ +/* Test the ino-map module. + Copyright (C) 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* Written by Jim Meyering. */ + +#include <config.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +/* FIXME: once/if in gnulib, use #include "macros.h" in place of this */ +#define ASSERT(expr) \ + do \ + { \ + if (!(expr)) \ + { \ + fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ + fflush (stderr); \ + abort (); \ + } \ + } \ + while (0) + +#include "ino-map.h" + +int +main () +{ + /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ + enum { INO_MAP_INIT = 123 }; + struct ino_map *ino_map = ino_map_alloc (INO_MAP_INIT); + ASSERT (ino_map != NULL); + + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + + int i; + for (i = 0; i < 100; i++) + { + ASSERT (ino_map_insert (ino_map, 10000 + i) == INO_MAP_INIT + 3 + i); + } + + ino_map_free (ino_map); + + return 0; +} diff --git a/src/du.c b/src/du.c index a90568e..bc24861 100644 --- a/src/du.c +++ b/src/du.c @@ -30,10 +30,10 @@ #include "system.h" #include "argmatch.h" #include "argv-iter.h" +#include "di-set.h" #include "error.h" #include "exclude.h" #include "fprintftime.h" -#include "hash.h" #include "human.h" #include "quote.h" #include "quotearg.h" @@ -60,19 +60,8 @@ extern bool fts_debug; # define FTS_CROSS_CHECK(Fts) #endif -/* Initial size of the hash table. */ -#define INITIAL_TABLE_SIZE 103 - -/* Hash structure for inode and device numbers. The separate entry - structure makes it easier to rehash "in place". */ -struct entry -{ - ino_t st_ino; - dev_t st_dev; -}; - /* A set of dev/ino pairs. */ -static Hash_table *htab; +static struct di_set *di_set; /* Define a class for collecting directory information. */ @@ -336,66 +325,16 @@ Mandatory arguments to long options are mandatory for short options too.\n\ exit (status); } -static size_t -entry_hash (void const *x, size_t table_size) -{ - struct entry const *p = x; - - /* Ignoring the device number here should be fine. */ - /* The cast to uintmax_t prevents negative remainders - if st_ino is negative. */ - return (uintmax_t) p->st_ino % table_size; -} - -/* Compare two dev/ino pairs. Return true if they are the same. */ -static bool -entry_compare (void const *x, void const *y) -{ - struct entry const *a = x; - struct entry const *b = y; - return SAME_INODE (*a, *b) ? true : false; -} - /* Try to insert the INO/DEV pair into the global table, HTAB. Return true if the pair is successfully inserted, false if the pair is already in the table. */ static bool hash_ins (ino_t ino, dev_t dev) { - struct entry *ent; - struct entry *ent_from_table; - - ent = xmalloc (sizeof *ent); - ent->st_ino = ino; - ent->st_dev = dev; - - ent_from_table = hash_insert (htab, ent); - if (ent_from_table == NULL) - { - /* Insertion failed due to lack of memory. */ - xalloc_die (); - } - - if (ent_from_table == ent) - { - /* Insertion succeeded. */ - return true; - } - - /* That pair is already in the table, so ENT was not inserted. Free it. */ - free (ent); - - return false; -} - -/* Initialize the hash table. */ -static void -hash_init (void) -{ - htab = hash_initialize (INITIAL_TABLE_SIZE, NULL, - entry_hash, entry_compare, free); - if (htab == NULL) + int inserted = di_set_insert (di_set, dev, ino); + if (inserted < 0) xalloc_die (); + return inserted; } /* FIXME: this code is nearly identical to code in date.c */ @@ -947,8 +886,10 @@ main (int argc, char **argv) if (!ai) xalloc_die (); - /* Initialize the hash structure for inode numbers. */ - hash_init (); + /* Initialize the set of dev,inode pairs. */ + di_set = di_set_alloc (); + if (!di_set) + xalloc_die (); bit_flags |= symlink_deref_bits; static char *temp_argv[] = { NULL, NULL }; @@ -1019,6 +960,7 @@ main (int argc, char **argv) } argv_iter_free (ai); + di_set_free (di_set); if (files_from && (ferror (stdin) || fclose (stdin) != 0)) error (EXIT_FAILURE, 0, _("error reading %s"), quote (files_from)); @@ -1026,7 +968,5 @@ main (int argc, char **argv) if (print_grand_total) print_size (&tot_dui, _("total")); - hash_free (htab); - exit (ok ? EXIT_SUCCESS : EXIT_FAILURE); }
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Fri, 02 Jul 2010 22:09:02 GMT) Full text and rfc822 format available.owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 04 Jul 2010 09:32:02 GMT) Full text and rfc822 format available.Message #50 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: bug-coreutils <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Sun, 04 Jul 2010 11:31:18 +0200
Paul Eggert wrote: > (A circa-2008 patch! You're almost as backlogged as I am!) > > Anyway, here's the revised proposal for that, which I privately > promised a day or two ago. It avoids using unions and bitfields, > which are a bit hard to follow and apparently run into compiler bugs > with GCC 4.5.0 on x86. In my measurements, this patch uses a bit less > memory than the earlier patch, when the earlier patch worked for me; > but this is highly test-dependent and it would help to try this out on > other test cases. > > This patch continues to treat inode 0 specially, to work around a > problem with hash.c. Recently hash.c was changed to no longer reject > NULL keys, but its internals still depend on not allowing NULL keys so Thanks. I noticed that and looked into fixing it, but for now have decided simply to revert the abort-removal bit. I've just pushed this change: http://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=fbc47512f3f8e153 > this patch is careful to never create a key by casting 0 to (void *). > The assumption is that 0 is the only size_t value i such that ((void > *) i == NULL); this assumption is true for all practical architectures > that I know about, though it is not guaranteed by any standard. > > One other thing: the gnulib (or prototype gnulib) stuff I converted > to C90; I assume that's still a constraint for gnulib (though not > for coreutils proper). This was mostly just moving decls to the > starts of blocks. Thanks! Here are minor suggestions on wording and style. I find that putting the single-stmt (currently "else") block after a nontrivial braced "then" block is slightly less readable. So I reversed the condition, block ordering, and added comments. You're welcome to squash this into yours. I've already done that, locally, then teased apart our changes so your improvements are presented as a change on top of mine, rather than as an alternative. I'll push my du changes shortly, and then post a proposed new version of your patch. From 9f69331886353e5794bb7dc9c0a31c3cd99d33f7 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Sat, 3 Jul 2010 14:15:03 +0200 Subject: [PATCH] touch-up --- gl/lib/di-set.c | 6 +++--- gl/lib/ino-map.c | 11 +++++++---- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index 76baf54..b6dce45 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -71,7 +71,7 @@ struct di_set struct di_ent *probe; }; -/* Hash a directory set entry. */ +/* Hash a device-inode-set entry. */ static size_t di_ent_hash (void const *x, size_t table_size) { @@ -88,7 +88,7 @@ di_ent_hash (void const *x, size_t table_size) } } -/* Return true if two directory set entries are the same. */ +/* Return true if two device-inode-set entries are the same. */ static bool di_ent_compare (void const *x, void const *y) { @@ -97,7 +97,7 @@ di_ent_compare (void const *x, void const *y) return a->dev == b->dev; } -/* Free a directory set entry. */ +/* Free a device-inode-set entry. */ static void di_ent_free (void *v) { diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c index 083b49d..1123d52 100644 --- a/gl/lib/ino-map.c +++ b/gl/lib/ino-map.c @@ -138,11 +138,16 @@ ino_map_insert (struct ino_map *im, ino_t ino) if (! ent) return INO_MAP_INSERT_FAILURE; - if (ent == probe) + if (ent != probe) + { + /* There was already an existing entry. Use it. */ + probe->mapped_ino = ent->mapped_ino; + } + else { /* If adding 1 to map->next_mapped_ino would cause it to overflow to zero, then it must equal INO_MAP_INSERT_FAILURE, - which is value that should be returned in that case. Verify + which is the value that should be returned in that case. Verify that this works. */ verify (INO_MAP_INSERT_FAILURE + 1 == 0); @@ -152,8 +157,6 @@ ino_map_insert (struct ino_map *im, ino_t ino) /* INO is new; allocate a mapped inode number for it. */ probe->mapped_ino = im->next_mapped_ino++; } - else - probe->mapped_ino = ent->mapped_ino; return probe->mapped_ino; } -- 1.7.2.rc1.192.g262ff
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 04 Jul 2010 09:32:02 GMT) Full text and rfc822 format available.owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 04 Jul 2010 10:01:02 GMT) Full text and rfc822 format available.Message #56 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: 6524 <at> debbugs.gnu.org Cc: Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Sun, 04 Jul 2010 12:00:33 +0200
Jim Meyering wrote: > This is just a preliminary heads-up. At least one detail must be > addressed before I push: I haven't written justification for the > change to hash.c (I confess that I don't remember -- parts of this > change date back to Dec 2008 ;-) I've pushed the four change-sets I wrote. Here's Paul's improvement, including the touch-up changes I mentioned. Paul, one question I have is whether to include an initial_inode_set_size parameter in your di_set_alloc function. That would make it more efficient in the event that an application happens to know in advance the number of inodes it will process (say if it's processing a single device and all of its inodes). However, there's no rush to address that -- you're welcome to push this as-is. From 08ac2f42353a35451696f9a390da1e5e6c43adaa Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert <at> cs.ucla.edu> Date: Sun, 4 Jul 2010 09:06:59 +0200 Subject: [PATCH] du: improve device-inode tracking code The preceding implementation would not save space when most inode numbers have one of the top few bits set. Remove that limitation by using a separate inode-managing set for each distinct device. This patch continues to treat inode 0 specially, to work around a problem with hash.c. Recently hash.c was changed to no longer reject NULL keys, but its internals still depend on not allowing NULL keys so this patch is careful to never create a key by casting 0 to (void *). The assumption is that 0 is the only size_t value i such that ((void *) i == NULL); this assumption is true for all practical architectures that I know about, though it is not guaranteed by any standard. The preceding implementation also ran afoul of the i686-specific gcc bug, c/44806 (fixed in gcc-4.5.1), so replacing its bitfield- and union-using code with this simpler design is doubly welcome. * gl/lib/ino-map.c, gl/lib/ino-map.h: New files. * gl/modules/ino-map, gl/modules/ino-map-tests: New files. * gl/modules/di-set, gl/modules/di-set-tests: New files. * gl/tests/test-di-set.c, gl/tests/test-ino-map.c: New files. * gl/lib/dev-map.c, gl/lib/dev-map.h: Remove files. * gl/modules/dev-map, gl/modules/dev-map-tests: Remove files. * gl/tests/test-dev-map.c: Remove files. * src/du.c (INITIAL_DI_SET_SIZE): Remove definition; no longer needed, since di_set_alloc takes no size parameter. (di_set): Use a global pointer rather than a struct. Update all uses. (main): Hoist a di_set_free call. --- gl/lib/dev-map.c | 110 --------- gl/lib/dev-map.h | 23 -- gl/lib/di-set.c | 350 ++++++++++++--------------- gl/lib/di-set.h | 28 +-- gl/lib/ino-map.c | 162 ++++++++++++ gl/lib/ino-map.h | 14 + gl/modules/dev-map | 23 -- gl/modules/dev-map-tests | 10 - gl/modules/di-set | 3 +- gl/modules/ino-map | 24 ++ gl/modules/ino-map-tests | 10 + gl/tests/test-di-set.c | 31 +-- gl/tests/{test-dev-map.c => test-ino-map.c} | 34 ++-- src/du.c | 21 +- 14 files changed, 399 insertions(+), 444 deletions(-) delete mode 100644 gl/lib/dev-map.c delete mode 100644 gl/lib/dev-map.h create mode 100644 gl/lib/ino-map.c create mode 100644 gl/lib/ino-map.h delete mode 100644 gl/modules/dev-map delete mode 100644 gl/modules/dev-map-tests create mode 100644 gl/modules/ino-map create mode 100644 gl/modules/ino-map-tests rename gl/tests/{test-dev-map.c => test-ino-map.c} (72%) diff --git a/gl/lib/dev-map.c b/gl/lib/dev-map.c deleted file mode 100644 index 363f5af..0000000 --- a/gl/lib/dev-map.c +++ /dev/null @@ -1,110 +0,0 @@ -/* Map a dev_t device number to a small integer. - - Copyright 2009-2010 Free Software Foundation, Inc. - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see <http://www.gnu.org/licenses/>. */ - -/* written by Jim Meyering */ - -#include <config.h> -#include "dev-map.h" - -#include <stdio.h> -#include <assert.h> -#include <stdint.h> -#include <stdlib.h> - -struct dev_map_ent -{ - dev_t dev; - uint32_t mapped_dev; -}; - -enum { INITIAL_DEV_MAP_TABLE_SIZE = 31 }; - -static size_t -dev_hash (void const *x, size_t table_size) -{ - struct dev_map_ent const *p = x; - return (uintmax_t) p->dev % table_size; -} - -/* Compare two dev_map_ent structs on "dev". - Return true if they are the same. */ -static bool -dev_compare (void const *x, void const *y) -{ - struct dev_map_ent const *a = x; - struct dev_map_ent const *b = y; - return a->dev == b->dev ? true : false; -} - -/* Initialize state. */ -int -dev_map_init (struct dev_map *dev_map) -{ - assert (dev_map != NULL); - dev_map->n_device = 0; - dev_map->dev_map = hash_initialize (INITIAL_DEV_MAP_TABLE_SIZE, NULL, - dev_hash, dev_compare, free); - return dev_map->dev_map ? 0 : -1; -} - -void -dev_map_free (struct dev_map *dev_map) -{ - hash_free (dev_map->dev_map); -} - -/* Insert device number, DEV, into the map, DEV_MAP if it's not there already, - and return the nonnegative, mapped and usually smaller, number. - If DEV is already in DEV_MAP, return the existing mapped value. - Upon memory allocation failure, return -1. */ -int -dev_map_insert (struct dev_map *dev_map, dev_t dev) -{ - /* Attempt to insert <DEV,n_device> in the map. - Possible outcomes: - - DEV was already there, with a different necessarily smaller value - - DEV was not there, (thus was just inserted) - - insert failed: out of memory - Return -1 on out of memory. - */ - - struct dev_map_ent *ent_from_table; - struct dev_map_ent *ent = malloc (sizeof *ent); - if (!ent) - return -1; - - ent->dev = dev; - ent->mapped_dev = dev_map->n_device; - - ent_from_table = hash_insert (dev_map->dev_map, ent); - if (ent_from_table == NULL) - { - /* Insertion failed due to lack of memory. */ - return -1; - } - - if (ent_from_table == ent) - { - /* Insertion succeeded: new device. */ - return dev_map->n_device++; - } - - /* That DEV value is already in the table, so ENT was not inserted. - Free it and return the already-associated value. */ - free (ent); - return ent_from_table->mapped_dev; -} diff --git a/gl/lib/dev-map.h b/gl/lib/dev-map.h deleted file mode 100644 index f093d90..0000000 --- a/gl/lib/dev-map.h +++ /dev/null @@ -1,23 +0,0 @@ -#include <stddef.h> -#include <sys/types.h> -#include <sys/stat.h> -#include "hash.h" - -struct dev_map -{ - /* KEY,VAL pair, where KEY is the raw st_dev value - and VAL is the small number that maps to. */ - struct hash_table *dev_map; - size_t n_device; -}; - -#undef _ATTRIBUTE_NONNULL_ -#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ -# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) -#else -# define _ATTRIBUTE_NONNULL_(m) -#endif - -int dev_map_init (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); -void dev_map_free (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); -int dev_map_insert (struct dev_map *, dev_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index 3c4717b..b6dce45 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -15,262 +15,218 @@ You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. */ -/* written by Jim Meyering */ +/* written by Paul Eggert and Jim Meyering */ #include <config.h> #include "di-set.h" -#include <stdio.h> -#include <assert.h> +#include "hash.h" +#include "ino-map.h" + #include <stdint.h> #include <stdlib.h> -#include <sys/types.h> -#include <sys/stat.h> - -#include "verify.h" - -/* Set operations for device-inode pairs stored in a space-efficient manner. - A naive mapping uses 16 bytes to save a single st_dev, st_ino pair. - However, in many applications, the vast majority of actual device,inode - number pairs can be efficiently compressed to fit in 8 or even 4 bytes, - by using a separate table to map a relatively small number of devices - to small integers. */ - -#define N_DEV_BITS_4 5 -#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) - -#define N_DEV_BITS_8 8 -#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) - -/* Note how the last bit is always set. - This is required, in order to be able to distinguish - an encoded di_ent value from a malloc-returned pointer, - which must be 4-byte-aligned or better. */ -struct dev_ino_4 -{ - uint32_t mode:2; /* must be first */ - uint32_t short_ino:N_INO_BITS_4; - uint32_t mapped_dev:N_DEV_BITS_4; - uint32_t always_set:1; -}; -verify (N_DEV_BITS_4 <= 8 * sizeof (int)); -verify (sizeof (struct dev_ino_4) == 4); - -struct dev_ino_8 -{ - uint32_t mode:2; /* must be first */ - uint64_t short_ino:N_INO_BITS_8; - uint32_t mapped_dev:N_DEV_BITS_8; - uint32_t always_set:1; -}; -verify (sizeof (struct dev_ino_8) == 8); -struct dev_ino_full -{ - uint32_t mode:2; /* must be first */ - dev_t dev; - ino_t ino; -}; - -enum di_mode -{ - DI_MODE_4 = 1, - DI_MODE_8 = 2, - DI_MODE_FULL = 3 -}; - -/* - di_mode raw_inode mapped dev always_set - \____________|_______________\_____/ - 4-byte | 2| 25 | 5 |1| mapped_dev - `----------------------------------------------------|-----. - 8-byte | 2| 53 | 8 |1| - `----------------------------------------------------------' -*/ +/* The hash package hashes "void *", but this package wants to hash + integers. Use integers that are as large as possible, but no + larger than void *, so that they can be cast to void * and back + without losing information. */ +typedef size_t hashint; +#define HASHINT_MAX ((hashint) -1) + +/* Integers represent inode numbers. Integers in the range + 1..(LARGE_INO_MIN-1) represent inode numbers directly. (The hash + package does not work with null pointers, so inode 0 cannot be used + as a key.) To find the representations of other inode numbers, map + them through INO_MAP. */ +#define LARGE_INO_MIN (HASHINT_MAX / 2) + +/* Set operations for device-inode pairs stored in a space-efficient + manner. Use a two-level hash table. The top level hashes by + device number, as there are typically a small number of devices. + The lower level hashes by mapped inode numbers. In the typical + case where the inode number is positive and small, the inode number + maps to itself, masquerading as a void * value; otherwise, its + value is the result of hashing the inode value through INO_MAP. */ + +/* A pair that maps a device number to a set of inode numbers. */ struct di_ent { - union - { - struct dev_ino_4 di4; - struct dev_ino_8 di8; - struct dev_ino_full full; - uint32_t u32; - uint64_t u64; - void *ptr; - } u; -}; - -struct dev_map_ent -{ dev_t dev; - uint32_t mapped_dev; + struct hash_table *ino_set; }; -static inline bool -is_encoded_ptr (struct di_ent const *v) +/* A two-level hash table that manages and indexes these pairs. */ +struct di_set { - return (size_t) v % 4; -} + /* Map device numbers to sets of inode number representatives. */ + struct hash_table *dev_map; -static struct di_ent -decode_ptr (struct di_ent const *v) -{ - if (!is_encoded_ptr (v)) - return *v; + /* If nonnull, map large inode numbers to their small + representatives. If null, there are no large inode numbers in + this set. */ + struct ino_map *ino_map; - struct di_ent di; - di.u.ptr = (void *) v; - return di; -} + /* Cache of the most recently allocated and otherwise-unused storage + for probing this table. */ + struct di_ent *probe; +}; +/* Hash a device-inode-set entry. */ static size_t di_ent_hash (void const *x, size_t table_size) { - struct di_ent e = decode_ptr (x); - return (e.u.di4.mode == DI_MODE_4 - ? e.u.u32 - : (e.u.di4.mode == DI_MODE_8 - ? e.u.u64 - : e.u.full.ino)) % table_size; + struct di_ent const *p = x; + + /* Use ordinary % if it's known to return a nonnegative value. + Otherwise, fall back on uintmax_t %, which could be much slower. */ + if ((dev_t) -1 % (size_t) 2 == 1) + return p->dev % table_size; + else + { + uintmax_t dev = p->dev; + return dev % table_size; + } } -/* Compare two di_ent structs. - Return true if they are the same. */ +/* Return true if two device-inode-set entries are the same. */ static bool di_ent_compare (void const *x, void const *y) { - struct di_ent a = decode_ptr (x); - struct di_ent b = decode_ptr (y); - if (a.u.di4.mode != b.u.di4.mode) - return false; - - if (a.u.di4.mode == DI_MODE_4) - return (a.u.di4.short_ino == b.u.di4.short_ino - && a.u.di4.mapped_dev == b.u.di4.mapped_dev); - - if (a.u.di8.mode == DI_MODE_8) - return (a.u.di8.short_ino == b.u.di8.short_ino - && a.u.di8.mapped_dev == b.u.di8.mapped_dev); - - return (a.u.full.ino == b.u.full.ino - && a.u.full.dev == b.u.full.dev); + struct di_ent const *a = x; + struct di_ent const *b = y; + return a->dev == b->dev; } +/* Free a device-inode-set entry. */ static void di_ent_free (void *v) { - if ( ! is_encoded_ptr (v)) - free (v); + struct di_ent *a = v; + hash_free (a->ino_set); + free (a); } -int -di_set_init (struct di_set_state *dis, size_t initial_size) +/* Create a set of device-inode pairs. Return NULL on allocation failure. */ +struct di_set * +di_set_alloc (void) { - if (dev_map_init (&dis->dev_map) < 0) - return -1; + struct di_set *dis = malloc (sizeof *dis); + if (dis) + { + enum { INITIAL_DEV_MAP_SIZE = 11 }; + dis->dev_map = hash_initialize (INITIAL_DEV_MAP_SIZE, NULL, + di_ent_hash, di_ent_compare, + di_ent_free); + if (! dis->dev_map) + { + free (dis); + return NULL; + } + dis->ino_map = NULL; + dis->probe = NULL; + } - dis->di_set = hash_initialize (initial_size, NULL, - di_ent_hash, di_ent_compare, di_ent_free); - return dis->di_set ? 0 : -1; + return dis; } +/* Free a set of device-inode pairs. */ void -di_set_free (struct di_set_state *dis) +di_set_free (struct di_set *dis) { - dev_map_free (&dis->dev_map); - hash_free (dis->di_set); + hash_free (dis->dev_map); + free (dis->ino_map); + free (dis->probe); + free (dis); } -/* Given a device-inode set, DIS, create an entry for the DEV,INO - pair, and store it in *V. If possible, encode DEV,INO into the pointer - itself, but if not, allocate space for a full "struct di_ent" and set *V - to that pointer. Upon memory allocation failure, return -1. - Otherwise return 0. */ -int -di_ent_create (struct di_set_state *dis, - dev_t dev, ino_t ino, - struct di_ent **v) +/* Hash an encoded inode number I. */ +static size_t +di_ino_hash (void const *i, size_t table_size) { - static int prev_m = -1; - static dev_t prev_dev = -1; - struct di_ent di_ent; - int mapped_dev; + return (hashint) i % table_size; +} - if (dev == prev_dev) - mapped_dev = prev_m; - else +/* Using the DIS table, map a device to a hash table that represents + a set of inode numbers. Return NULL on error. */ +static struct hash_table * +map_device (struct di_set *dis, dev_t dev) +{ + /* Find space for the probe, reusing the cache if available. */ + struct di_ent *ent; + struct di_ent *probe = dis->probe; + if (probe) { - mapped_dev = dev_map_insert (&dis->dev_map, dev); - if (mapped_dev < 0) - return -1; - prev_dev = dev; - prev_m = mapped_dev; + /* If repeating a recent query, return the cached result. */ + if (probe->dev == dev) + return probe->ino_set; } - - if (mapped_dev < (1 << N_DEV_BITS_4) - && ino < (1 << N_INO_BITS_4)) + else { -#if lint - /* When this struct is smaller than a pointer, initialize - the pointer so tools like valgrind don't complain about - the uninitialized bytes. */ - if (sizeof di_ent.u.di4 < sizeof di_ent.u.ptr) - di_ent.u.ptr = NULL; -#endif - di_ent.u.di4.mode = DI_MODE_4; - di_ent.u.di4.short_ino = ino; - di_ent.u.di4.mapped_dev = mapped_dev; - di_ent.u.di4.always_set = 1; - *v = di_ent.u.ptr; + dis->probe = probe = malloc (sizeof *probe); + if (! probe) + return NULL; } - else if (mapped_dev < (1 << N_DEV_BITS_8) - && ino < ((uint64_t) 1 << N_INO_BITS_8)) + + /* Probe for the device. */ + probe->dev = dev; + ent = hash_insert (dis->dev_map, probe); + if (! ent) + return NULL; + + if (ent == probe) { - di_ent.u.di8.mode = DI_MODE_8; - di_ent.u.di8.short_ino = ino; - di_ent.u.di8.mapped_dev = mapped_dev; - di_ent.u.di8.always_set = 1; - *v = di_ent.u.ptr; + enum { INITIAL_INO_SET_SIZE = 1021 }; + + /* Prepare to allocate a new probe next time; this one is in use. */ + dis->probe = NULL; + + /* DEV is new; allocate an inode set for it. */ + probe->ino_set = hash_initialize (INITIAL_INO_SET_SIZE, NULL, + di_ino_hash, NULL, NULL); } else + probe->ino_set = ent->ino_set; + + return probe->ino_set; +} + +/* Using the DIS table, map an inode number to a mapped value. + Return INO_MAP_INSERT_FAILURE on error. */ +static hashint +map_inode_number (struct di_set *dis, ino_t ino) +{ + if (0 < ino && ino < LARGE_INO_MIN) + return ino; + + if (! dis->ino_map) { - /* Handle the case in which INO is too large or in which (far less - likely) we encounter hard-linked files on 2^N_DEV_BITS_8 - different devices. */ - struct di_ent *p = malloc (sizeof *p); - if (!p) - return -1; - assert ((size_t) p % 4 == 0); - p->u.full.mode = DI_MODE_FULL; - p->u.full.ino = ino; - p->u.full.dev = dev; - *v = p; + dis->ino_map = ino_map_alloc (LARGE_INO_MIN); + if (! dis->ino_map) + return INO_MAP_INSERT_FAILURE; } - return 0; + return ino_map_insert (dis->ino_map, ino); } -/* Attempt to insert the DEV,INO pair into the set, DIS. - If it matches a pair already in DIS, don't modify DIS and return 0. +/* Attempt to insert the DEV,INO pair into the set DIS. + If it matches a pair already in DIS, keep that pair and return 0. Otherwise, if insertion is successful, return 1. Upon any failure return -1. */ int -di_set_insert (struct di_set_state *dis, dev_t dev, ino_t ino) +di_set_insert (struct di_set *dis, dev_t dev, ino_t ino) { - struct di_ent *v; - if (di_ent_create (dis, dev, ino, &v) < 0) - return -1; + hashint i; - int err = hash_insert0 (dis->di_set, v, NULL); - if (err == -1) /* Insertion failed due to lack of memory. */ + /* Map the device number to a set of inodes. */ + struct hash_table *ino_set = map_device (dis, dev); + if (! ino_set) return -1; - if (err == 1) /* Insertion succeeded. */ - return 1; - - /* That pair is already in the table, so ENT was not inserted. Free it. */ - if (! is_encoded_ptr (v)) - free (v); + /* Map the inode number to a small representative I. */ + i = map_inode_number (dis, ino); + if (i == INO_MAP_INSERT_FAILURE) + return -1; - return 0; + /* Put I into the inode set. */ + return hash_insert0 (ino_set, (void *) i, NULL); } diff --git a/gl/lib/di-set.h b/gl/lib/di-set.h index f90d0dd..d30a31e 100644 --- a/gl/lib/di-set.h +++ b/gl/lib/di-set.h @@ -1,28 +1,12 @@ -#include "dev-map.h" - -struct di_set_state -{ - /* A map to help compact device numbers. */ - struct dev_map dev_map; - - /* A set of compact dev,inode pairs. */ - struct hash_table *di_set; -}; +#include <sys/types.h> #undef _ATTRIBUTE_NONNULL_ #if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ -# define _ATTRIBUTE_NONNULL_(m,...) __attribute__ ((__nonnull__ (m))) +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) #else -# define _ATTRIBUTE_NONNULL_(m,...) +# define _ATTRIBUTE_NONNULL_(m) #endif -int di_set_init (struct di_set_state *, size_t) _ATTRIBUTE_NONNULL_ (1); -void di_set_free (struct di_set_state *) _ATTRIBUTE_NONNULL_ (1); -int di_set_insert (struct di_set_state *, dev_t, ino_t) - _ATTRIBUTE_NONNULL_ (1); - -struct di_ent; -int di_ent_create (struct di_set_state *di_set_state, - dev_t dev, ino_t ino, - struct di_ent **di_ent) - _ATTRIBUTE_NONNULL_ (1,4); +struct di_set *di_set_alloc (void); +int di_set_insert (struct di_set *, dev_t, ino_t) _ATTRIBUTE_NONNULL_ (1); +void di_set_free (struct di_set *) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c new file mode 100644 index 0000000..1123d52 --- /dev/null +++ b/gl/lib/ino-map.c @@ -0,0 +1,162 @@ +/* Map an ino_t inode number to a small integer. + + Copyright 2009, 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Paul Eggert and Jim Meyering */ + +#include <config.h> +#include "ino-map.h" + +#include "hash.h" +#include "verify.h" + +#include <stdint.h> +#include <stdlib.h> + +/* A pair that maps an inode number to a mapped inode number; the + latter is a small unique ID for the former. */ +struct ino_map_ent +{ + ino_t ino; + size_t mapped_ino; +}; + +/* A table that manages and indexes these pairs. */ +struct ino_map +{ + /* A table of KEY,VAL pairs, where KEY is the raw ino_t value and + VAL is the small number that it maps to. */ + struct hash_table *map; + + /* The next mapped inode number to hand out. */ + size_t next_mapped_ino; + + /* Cache of the most recently allocated and otherwise-unused storage + for probing the table. */ + struct ino_map_ent *probe; +}; + +/* Hash an inode map entry. */ +static size_t +ino_hash (void const *x, size_t table_size) +{ + struct ino_map_ent const *p = x; + + /* Use ordinary % if it's known to return a nonnegative value. + Otherwise, fall back on uintmax_t %, which could be much slower. */ + if ((ino_t) -1 % (size_t) 2 == 1) + return p->ino % table_size; + else + { + uintmax_t ino = p->ino; + return ino % table_size; + } +} + +/* Return true if two inode map entries are the same. */ +static bool +ino_compare (void const *x, void const *y) +{ + struct ino_map_ent const *a = x; + struct ino_map_ent const *b = y; + return a->ino == b->ino; +} + +/* Allocate an inode map that will hand out integers starting with + NEXT_MAPPED_INO. Return NULL if memory is exhausted. */ +struct ino_map * +ino_map_alloc (size_t next_mapped_ino) +{ + struct ino_map *im = malloc (sizeof *im); + + if (im) + { + enum { INITIAL_INO_MAP_TABLE_SIZE = 1021 }; + im->map = hash_initialize (INITIAL_INO_MAP_TABLE_SIZE, NULL, + ino_hash, ino_compare, free); + if (! im->map) + { + free (im); + return NULL; + } + im->next_mapped_ino = next_mapped_ino; + im->probe = NULL; + } + + return im; +} + +/* Free an inode map. */ +void +ino_map_free (struct ino_map *map) +{ + hash_free (map->map); + free (map->probe); + free (map); +} + + +/* Insert into MAP the inode number INO if it's not there already, + and return its nonnegative mapped inode number. + If INO is already in MAP, return the existing mapped inode number. + Return INO_MAP_INSERT_FAILURE on memory or counter exhaustion. */ +size_t +ino_map_insert (struct ino_map *im, ino_t ino) +{ + struct ino_map_ent *ent; + + /* Find space for the probe, reusing the cache if available. */ + struct ino_map_ent *probe = im->probe; + if (probe) + { + /* If repeating a recent query, return the cached result. */ + if (probe->ino == ino) + return probe->mapped_ino; + } + else + { + im->probe = probe = malloc (sizeof *probe); + if (! probe) + return INO_MAP_INSERT_FAILURE; + } + + probe->ino = ino; + ent = hash_insert (im->map, probe); + if (! ent) + return INO_MAP_INSERT_FAILURE; + + if (ent != probe) + { + /* There was already an existing entry. Use it. */ + probe->mapped_ino = ent->mapped_ino; + } + else + { + /* If adding 1 to map->next_mapped_ino would cause it to + overflow to zero, then it must equal INO_MAP_INSERT_FAILURE, + which is the value that should be returned in that case. Verify + that this works. */ + verify (INO_MAP_INSERT_FAILURE + 1 == 0); + + /* Prepare to allocate a new probe next time; this one is in use. */ + im->probe = NULL; + + /* INO is new; allocate a mapped inode number for it. */ + probe->mapped_ino = im->next_mapped_ino++; + } + + return probe->mapped_ino; +} diff --git a/gl/lib/ino-map.h b/gl/lib/ino-map.h new file mode 100644 index 0000000..661b78a --- /dev/null +++ b/gl/lib/ino-map.h @@ -0,0 +1,14 @@ +#include <sys/types.h> + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m) +#endif + +#define INO_MAP_INSERT_FAILURE ((size_t) -1) + +struct ino_map *ino_map_alloc (size_t); +void ino_map_free (struct ino_map *) _ATTRIBUTE_NONNULL_ (1); +size_t ino_map_insert (struct ino_map *, ino_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/modules/dev-map b/gl/modules/dev-map deleted file mode 100644 index 91f437b..0000000 --- a/gl/modules/dev-map +++ /dev/null @@ -1,23 +0,0 @@ -Description: -maintain a mapping of dev_t numbers to small integers - -Files: -lib/dev-map.c -lib/dev-map.h - -Depends-on: -hash - -configure.ac: - -Makefile.am: -lib_SOURCES += dev-map.c dev-map.h - -Include: -"dev-map.h" - -License -GPL - -Maintainer: -Jim Meyering diff --git a/gl/modules/dev-map-tests b/gl/modules/dev-map-tests deleted file mode 100644 index 4bec2e6..0000000 --- a/gl/modules/dev-map-tests +++ /dev/null @@ -1,10 +0,0 @@ -Files: -tests/test-dev-map.c - -Depends-on: - -configure.ac: - -Makefile.am: -TESTS += test-dev-map -check_PROGRAMS += test-dev-map diff --git a/gl/modules/di-set b/gl/modules/di-set index fe52778..562db14 100644 --- a/gl/modules/di-set +++ b/gl/modules/di-set @@ -6,9 +6,8 @@ lib/di-set.c lib/di-set.h Depends-on: -dev-map +ino-map hash -verify configure.ac: diff --git a/gl/modules/ino-map b/gl/modules/ino-map new file mode 100644 index 0000000..06ee51d --- /dev/null +++ b/gl/modules/ino-map @@ -0,0 +1,24 @@ +Description: +maintain a mapping of ino_t numbers to small integers + +Files: +lib/ino-map.c +lib/ino-map.h + +Depends-on: +hash +verify + +configure.ac: + +Makefile.am: +lib_SOURCES += ino-map.c ino-map.h + +Include: +"ino-map.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/ino-map-tests b/gl/modules/ino-map-tests new file mode 100644 index 0000000..fa10b4f --- /dev/null +++ b/gl/modules/ino-map-tests @@ -0,0 +1,10 @@ +Files: +tests/test-ino-map.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-ino-map +check_PROGRAMS += test-ino-map diff --git a/gl/tests/test-di-set.c b/gl/tests/test-di-set.c index 4d5823e..e5fb6cb 100644 --- a/gl/tests/test-di-set.c +++ b/gl/tests/test-di-set.c @@ -1,4 +1,4 @@ -/* Test the dev-map module. +/* Test the di-set module. Copyright (C) 2010 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify @@ -36,41 +36,21 @@ #include "di-set.h" -/* FIXME: ugly duplication of code from di-set.c */ -#define N_DEV_BITS_4 5 -#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) - -#define N_DEV_BITS_8 8 -#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) - int main (void) { /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ - size_t initial_size = 61; - /* "real" code might prefer to avoid the allocation here, simply - declaring "struct di_set_state dis;", do a global substitution, - s/\<dis\>/\&dis/, and remove the final free. */ - struct di_set_state *dis = malloc (sizeof *dis); + struct di_set *dis = di_set_alloc (); ASSERT (dis); - ASSERT (di_set_init (dis, initial_size) == 0); - - struct di_ent *di_ent; - ASSERT (di_ent_create (dis, 1, 1, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1 << N_DEV_BITS_4, 1, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1, 1 << N_INO_BITS_4, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1, - (uint64_t) 1 << N_INO_BITS_8, &di_ent) == 0); - free (di_ent); ASSERT (di_set_insert (dis, 2, 5) == 1); /* first insertion succeeds */ ASSERT (di_set_insert (dis, 2, 5) == 0); /* duplicate fails */ ASSERT (di_set_insert (dis, 3, 5) == 1); /* diff dev, duplicate inode is ok */ ASSERT (di_set_insert (dis, 2, 8) == 1); /* same dev, different inode is ok */ - /* very large inode number */ - ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 1); - ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 0); /* dup */ + /* very large (or negative) inode number */ + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 1); + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 0); /* dup */ unsigned int i; for (i = 0; i < 3000; i++) @@ -79,7 +59,6 @@ main (void) ASSERT (di_set_insert (dis, 9, i) == 0); /* duplicate fails */ di_set_free (dis); - free (dis); return 0; } diff --git a/gl/tests/test-dev-map.c b/gl/tests/test-ino-map.c similarity index 72% rename from gl/tests/test-dev-map.c rename to gl/tests/test-ino-map.c index 98ba630..2b44602 100644 --- a/gl/tests/test-dev-map.c +++ b/gl/tests/test-ino-map.c @@ -1,4 +1,4 @@ -/* Test the dev-map module. +/* Test the ino-map module. Copyright (C) 2010 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify @@ -34,30 +34,30 @@ } \ while (0) -#include "dev-map.h" - -/* Risky: this is also defined in di-set.c; they should match. */ -#define N_DEV_BITS_4 5 +#include "ino-map.h" int main () { /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ - struct dev_map dev_map; - ASSERT (dev_map_init (&dev_map) == 0); - - ASSERT (dev_map_insert (&dev_map, 42) == 0); - ASSERT (dev_map_insert (&dev_map, 42) == 0); - ASSERT (dev_map_insert (&dev_map, 398) == 1); - ASSERT (dev_map_insert (&dev_map, 398) == 1); - - int32_t i; - for (i = 0; i < (1 << N_DEV_BITS_4); i++) + enum { INO_MAP_INIT = 123 }; + struct ino_map *ino_map = ino_map_alloc (INO_MAP_INIT); + ASSERT (ino_map != NULL); + + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + + int i; + for (i = 0; i < 100; i++) { - ASSERT (dev_map_insert (&dev_map, 10000+i) == 2+i); + ASSERT (ino_map_insert (ino_map, 10000 + i) == INO_MAP_INIT + 3 + i); } - dev_map_free (&dev_map); + ino_map_free (ino_map); return 0; } diff --git a/src/du.c b/src/du.c index 7c02c86..739be73 100644 --- a/src/du.c +++ b/src/du.c @@ -60,11 +60,8 @@ extern bool fts_debug; # define FTS_CROSS_CHECK(Fts) #endif -/* Initial size of the hash table. */ -enum { INITIAL_DI_SET_SIZE = 1021 }; - /* A set of dev/ino pairs. */ -static struct di_set_state di_set; +static struct di_set *di_set; /* Define a class for collecting directory information. */ @@ -337,14 +334,10 @@ Mandatory arguments to long options are mandatory for short options too.\n\ static bool hash_ins (ino_t ino, dev_t dev) { - int inserted = di_set_insert (&di_set, dev, ino); + int inserted = di_set_insert (di_set, dev, ino); if (inserted < 0) - { - /* Insertion failed due to lack of memory. */ - xalloc_die (); - } - - return inserted ? true : false; + xalloc_die (); + return inserted; } /* FIXME: this code is nearly identical to code in date.c */ @@ -905,7 +898,8 @@ main (int argc, char **argv) xalloc_die (); /* Initialize the set of dev,inode pairs. */ - if (di_set_init (&di_set, INITIAL_DI_SET_SIZE)) + di_set = di_set_alloc (); + if (!di_set) xalloc_die (); bit_flags |= symlink_deref_bits; @@ -977,6 +971,7 @@ main (int argc, char **argv) } argv_iter_free (ai); + di_set_free (di_set); if (files_from && (ferror (stdin) || fclose (stdin) != 0)) error (EXIT_FAILURE, 0, _("error reading %s"), quote (files_from)); @@ -984,7 +979,5 @@ main (int argc, char **argv) if (print_grand_total) print_size (&tot_dui, _("total")); - di_set_free (&di_set); - exit (ok ? EXIT_SUCCESS : EXIT_FAILURE); } -- 1.7.2.rc1.192.g262ff
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 04 Jul 2010 14:16:02 GMT) Full text and rfc822 format available.Message #59 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: 6524 <at> debbugs.gnu.org Cc: Paul Eggert <eggert <at> cs.ucla.edu> Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Sun, 04 Jul 2010 16:14:51 +0200
Jim Meyering wrote: > Jim Meyering wrote: >> This is just a preliminary heads-up. At least one detail must be >> addressed before I push: I haven't written justification for the >> change to hash.c (I confess that I don't remember -- parts of this >> change date back to Dec 2008 ;-) > > I've pushed the four change-sets I wrote. > Here's Paul's improvement, including the touch-up changes I mentioned. > > Paul, one question I have is whether to include an initial_inode_set_size > parameter in your di_set_alloc function. That would make it more > efficient in the event that an application happens to know in advance the > number of inodes it will process (say if it's processing a single device > and all of its inodes). However, there's no rush to address that -- > you're welcome to push this as-is. > > Subject: [PATCH] du: improve device-inode tracking code > > The preceding implementation would not save space when most inode > numbers have one of the top few bits set. Remove that limitation > by using a separate inode-managing set for each distinct device. BTW, if someone can test this patch on a 32-bit system on a tree with lots of hard-linked files that have inode numbers of 2^32 or larger, it would be instructive to see how it compares to the version of du before any of these changes. I think it will still save memory, but how does performance change in that case?
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Sun, 04 Jul 2010 22:45:04 GMT) Full text and rfc822 format available.Message #62 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Pádraig Brady <P <at> draigBrady.com> To: Jim Meyering <jim <at> meyering.net> Cc: bug-gnulib <at> gnu.org, Paul Eggert <eggert <at> CS.UCLA.EDU>, bug-coreutils <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Sun, 04 Jul 2010 23:43:33 +0100
On 02/07/10 20:17, Jim Meyering wrote: > Thanks again for the report. > However, while I was able to reproduce it (on Paul's system) > and debug it, it appears to be due to a miscompilation of di-set.o > when using a private copy of gcc-4.5.0. When I recompiled > that one file with gcc-Ubuntu 4.4.3-4ubuntu5 and -g -O2 > or with 4.5.0 and -g -O, the code works as expected. I used a private gcc 4.5.0 also, and switched to gcc 4.4.1 for debugging where I couldn't reproduce it (gcc 4.5.0 produces debugging info with incompatible line number info for my gdb (6.8.50)). I had used gcc 4.5.0 to test -Wlogical-op with coreutils which was also buggy unfortunately: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43772 I'll not be using that again. cheers, Pádraig.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 05 Jul 2010 05:30:03 GMT) Full text and rfc822 format available.Message #65 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Pádraig Brady <P <at> draigBrady.com> Cc: bug-coreutils <at> gnu.org, Paul Eggert <eggert <at> CS.UCLA.EDU>, bug-gnulib <at> gnu.org, 6524 <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Mon, 05 Jul 2010 07:29:26 +0200
Pádraig Brady wrote: > On 02/07/10 20:17, Jim Meyering wrote: >> Thanks again for the report. >> However, while I was able to reproduce it (on Paul's system) >> and debug it, it appears to be due to a miscompilation of di-set.o >> when using a private copy of gcc-4.5.0. When I recompiled >> that one file with gcc-Ubuntu 4.4.3-4ubuntu5 and -g -O2 >> or with 4.5.0 and -g -O, the code works as expected. > > I used a private gcc 4.5.0 also, and switched to > gcc 4.4.1 for debugging where I couldn't reproduce it > (gcc 4.5.0 produces debugging info with incompatible > line number info for my gdb (6.8.50)). I reported the code gen bug (i686-specific), and learned that it's fixed on the trunk and for the upcoming 4.5.1: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44806
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Tue, 06 Jul 2010 21:16:01 GMT) Full text and rfc822 format available.Message #68 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Tue, 06 Jul 2010 14:15:09 -0700
On 07/04/10 07:14, Jim Meyering wrote: > BTW, if someone can test this patch on a 32-bit system > on a tree with lots of hard-linked files that have inode numbers > of 2^32 or larger, it would be instructive to see how it compares > to the version of du before any of these changes. I think it will > still save memory, but how does performance change in that case? I looked into this and found a performance hit due to the 64-bit % in the code I proposed. I hadn't thought the % would be a CPU bottleneck! Anyway, I fixed this by XORing the two halves of the 64-bit dividend before doing a 32-bit %, and this caused the resulting code to run about as fast as coreutils 8.5 (any difference was below the resolution of the time stamp on my "du cu*" benchmark; I artificially modified my benchmark by adding 2**32 to all the inode numbers). I'll submit an updated patch shortly.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Tue, 06 Jul 2010 22:05:02 GMT) Full text and rfc822 format available.Message #71 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Tue, 06 Jul 2010 15:04:33 -0700
On 07/04/10 03:00, Jim Meyering wrote: > Paul, one question I have is whether to include an initial_inode_set_size > parameter in your di_set_alloc function. That would make it more > efficient in the event that an application happens to know in advance the > number of inodes it will process (say if it's processing a single device > and all of its inodes). However, there's no rush to address that -- > you're welcome to push this as-is. Thanks. Yes, the extra parameter would help for that; I had omitted it because I couldn't think of a useful real-world case. For now I pushed it without that change (please see below). This incorporates all your other suggestions, plus the performance improvement with 32-bit % that I mentioned earlier today. From f6e2e13d3fee5219a6daece59ec9cc6241a04813 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert <at> cs.ucla.edu> Date: Tue, 6 Jul 2010 14:53:14 -0700 Subject: [PATCH] du: Hash with a mechanism that's simpler and takes less memory. * gl/lib/dev-map.c, gl/lib/dev-map.h, gl/modules/dev-map: Remove. * gl/lib/ino-map.c, gl/lib/ino-map.h, gl/modules/ino-map: New files. * gl/modules/dev-map-tests, gl/tests/test-dev-map.c: Remove. * gl/modules/ino-map-tests, gl/tests/test-ino-map.c: New files. * gl/lib/di-set.h (struct di_set): Renamed from struct di_set_state, and now private. All uses changed. (_ATTRIBUTE_NONNULL_): Don't assume C99. (di_set_alloc): Renamed from di_set_init, with no size arg. Now allocates the object rather than initializing it. For now, this no longer takes an initial size; we can put this back later if it is needed. * gl/lib/di-set.c: Include hash.h, ino-map.h, and limits.h instead of stdio.h, assert.h, stdint.h, sys/types.h (di-set.h includes that now), sys/stat.h, and verify.h. (N_DEV_BITS_4, N_INO_BITS_4, N_DEV_BITS_8, N_INO_BITS_8): Remove. (struct dev_ino_4, struct dev_ino_8, struct dev_ino_full): Remove. (enum di_mode): Remove. (hashint): New typedef. (HASHINT_MAX, LARGE_INO_MIN): New macros. (struct di_ent): Now maps a dev_t to a inode set, instead of containing a union. (struct dev_map_ent): Remove. (struct di_set): New type. (is_encoded_ptr, decode_ptr, di_ent_create): Remove. (di_ent_hash, di_ent_compare, di_ent_free, di_set_alloc, di_set_free): (di_set_insert): Adjust to new representation. (di_ino_hash, map_device, map_inode_number): New functions. * gl/modules/di-set (Depends-on): Replace dev-map with ino-map. Remove 'verify'. * gl/tests/test-di-set.c: Adjust to the above changes to API. * src/du.c (INITIAL_DI_SET_SIZE): Remove. (hash_ins, main): Adjust to new di-set API. --- gl/lib/dev-map.c | 110 -------------- gl/lib/dev-map.h | 23 --- gl/lib/di-set.c | 353 ++++++++++++++++++++-------------------------- gl/lib/di-set.h | 28 +--- gl/lib/ino-map.c | 162 +++++++++++++++++++++ gl/lib/ino-map.h | 14 ++ gl/modules/dev-map | 23 --- gl/modules/dev-map-tests | 10 -- gl/modules/di-set | 3 +- gl/modules/ino-map | 24 +++ gl/modules/ino-map-tests | 10 ++ gl/tests/test-dev-map.c | 63 -------- gl/tests/test-di-set.c | 31 +---- gl/tests/test-ino-map.c | 63 ++++++++ src/du.c | 21 +-- 15 files changed, 448 insertions(+), 490 deletions(-) delete mode 100644 gl/lib/dev-map.c delete mode 100644 gl/lib/dev-map.h create mode 100644 gl/lib/ino-map.c create mode 100644 gl/lib/ino-map.h delete mode 100644 gl/modules/dev-map delete mode 100644 gl/modules/dev-map-tests create mode 100644 gl/modules/ino-map create mode 100644 gl/modules/ino-map-tests delete mode 100644 gl/tests/test-dev-map.c create mode 100644 gl/tests/test-ino-map.c diff --git a/gl/lib/dev-map.c b/gl/lib/dev-map.c deleted file mode 100644 index 363f5af..0000000 --- a/gl/lib/dev-map.c +++ /dev/null @@ -1,110 +0,0 @@ -/* Map a dev_t device number to a small integer. - - Copyright 2009-2010 Free Software Foundation, Inc. - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see <http://www.gnu.org/licenses/>. */ - -/* written by Jim Meyering */ - -#include <config.h> -#include "dev-map.h" - -#include <stdio.h> -#include <assert.h> -#include <stdint.h> -#include <stdlib.h> - -struct dev_map_ent -{ - dev_t dev; - uint32_t mapped_dev; -}; - -enum { INITIAL_DEV_MAP_TABLE_SIZE = 31 }; - -static size_t -dev_hash (void const *x, size_t table_size) -{ - struct dev_map_ent const *p = x; - return (uintmax_t) p->dev % table_size; -} - -/* Compare two dev_map_ent structs on "dev". - Return true if they are the same. */ -static bool -dev_compare (void const *x, void const *y) -{ - struct dev_map_ent const *a = x; - struct dev_map_ent const *b = y; - return a->dev == b->dev ? true : false; -} - -/* Initialize state. */ -int -dev_map_init (struct dev_map *dev_map) -{ - assert (dev_map != NULL); - dev_map->n_device = 0; - dev_map->dev_map = hash_initialize (INITIAL_DEV_MAP_TABLE_SIZE, NULL, - dev_hash, dev_compare, free); - return dev_map->dev_map ? 0 : -1; -} - -void -dev_map_free (struct dev_map *dev_map) -{ - hash_free (dev_map->dev_map); -} - -/* Insert device number, DEV, into the map, DEV_MAP if it's not there already, - and return the nonnegative, mapped and usually smaller, number. - If DEV is already in DEV_MAP, return the existing mapped value. - Upon memory allocation failure, return -1. */ -int -dev_map_insert (struct dev_map *dev_map, dev_t dev) -{ - /* Attempt to insert <DEV,n_device> in the map. - Possible outcomes: - - DEV was already there, with a different necessarily smaller value - - DEV was not there, (thus was just inserted) - - insert failed: out of memory - Return -1 on out of memory. - */ - - struct dev_map_ent *ent_from_table; - struct dev_map_ent *ent = malloc (sizeof *ent); - if (!ent) - return -1; - - ent->dev = dev; - ent->mapped_dev = dev_map->n_device; - - ent_from_table = hash_insert (dev_map->dev_map, ent); - if (ent_from_table == NULL) - { - /* Insertion failed due to lack of memory. */ - return -1; - } - - if (ent_from_table == ent) - { - /* Insertion succeeded: new device. */ - return dev_map->n_device++; - } - - /* That DEV value is already in the table, so ENT was not inserted. - Free it and return the already-associated value. */ - free (ent); - return ent_from_table->mapped_dev; -} diff --git a/gl/lib/dev-map.h b/gl/lib/dev-map.h deleted file mode 100644 index f093d90..0000000 --- a/gl/lib/dev-map.h +++ /dev/null @@ -1,23 +0,0 @@ -#include <stddef.h> -#include <sys/types.h> -#include <sys/stat.h> -#include "hash.h" - -struct dev_map -{ - /* KEY,VAL pair, where KEY is the raw st_dev value - and VAL is the small number that maps to. */ - struct hash_table *dev_map; - size_t n_device; -}; - -#undef _ATTRIBUTE_NONNULL_ -#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ -# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) -#else -# define _ATTRIBUTE_NONNULL_(m) -#endif - -int dev_map_init (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); -void dev_map_free (struct dev_map *) _ATTRIBUTE_NONNULL_ (1); -int dev_map_insert (struct dev_map *, dev_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index 3c4717b..e0e2b24 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -15,262 +15,221 @@ You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. */ -/* written by Jim Meyering */ +/* written by Paul Eggert and Jim Meyering */ #include <config.h> #include "di-set.h" -#include <stdio.h> -#include <assert.h> -#include <stdint.h> -#include <stdlib.h> -#include <sys/types.h> -#include <sys/stat.h> - -#include "verify.h" - -/* Set operations for device-inode pairs stored in a space-efficient manner. - A naive mapping uses 16 bytes to save a single st_dev, st_ino pair. - However, in many applications, the vast majority of actual device,inode - number pairs can be efficiently compressed to fit in 8 or even 4 bytes, - by using a separate table to map a relatively small number of devices - to small integers. */ - -#define N_DEV_BITS_4 5 -#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) - -#define N_DEV_BITS_8 8 -#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) - -/* Note how the last bit is always set. - This is required, in order to be able to distinguish - an encoded di_ent value from a malloc-returned pointer, - which must be 4-byte-aligned or better. */ -struct dev_ino_4 -{ - uint32_t mode:2; /* must be first */ - uint32_t short_ino:N_INO_BITS_4; - uint32_t mapped_dev:N_DEV_BITS_4; - uint32_t always_set:1; -}; -verify (N_DEV_BITS_4 <= 8 * sizeof (int)); -verify (sizeof (struct dev_ino_4) == 4); - -struct dev_ino_8 -{ - uint32_t mode:2; /* must be first */ - uint64_t short_ino:N_INO_BITS_8; - uint32_t mapped_dev:N_DEV_BITS_8; - uint32_t always_set:1; -}; -verify (sizeof (struct dev_ino_8) == 8); - -struct dev_ino_full -{ - uint32_t mode:2; /* must be first */ - dev_t dev; - ino_t ino; -}; +#include "hash.h" +#include "ino-map.h" -enum di_mode -{ - DI_MODE_4 = 1, - DI_MODE_8 = 2, - DI_MODE_FULL = 3 -}; +#include <limits.h> +#include <stdlib.h> -/* - di_mode raw_inode mapped dev always_set - \____________|_______________\_____/ - 4-byte | 2| 25 | 5 |1| mapped_dev - `----------------------------------------------------|-----. - 8-byte | 2| 53 | 8 |1| - `----------------------------------------------------------' -*/ +/* The hash package hashes "void *", but this package wants to hash + integers. Use integers that are as large as possible, but no + larger than void *, so that they can be cast to void * and back + without losing information. */ +typedef size_t hashint; +#define HASHINT_MAX ((hashint) -1) + +/* Integers represent inode numbers. Integers in the range + 1..(LARGE_INO_MIN-1) represent inode numbers directly. (The hash + package does not work with null pointers, so inode 0 cannot be used + as a key.) To find the representations of other inode numbers, map + them through INO_MAP. */ +#define LARGE_INO_MIN (HASHINT_MAX / 2) + +/* Set operations for device-inode pairs stored in a space-efficient + manner. Use a two-level hash table. The top level hashes by + device number, as there are typically a small number of devices. + The lower level hashes by mapped inode numbers. In the typical + case where the inode number is positive and small, the inode number + maps to itself, masquerading as a void * value; otherwise, its + value is the result of hashing the inode value through INO_MAP. */ + +/* A pair that maps a device number to a set of inode numbers. */ struct di_ent { - union - { - struct dev_ino_4 di4; - struct dev_ino_8 di8; - struct dev_ino_full full; - uint32_t u32; - uint64_t u64; - void *ptr; - } u; -}; - -struct dev_map_ent -{ dev_t dev; - uint32_t mapped_dev; + struct hash_table *ino_set; }; -static inline bool -is_encoded_ptr (struct di_ent const *v) +/* A two-level hash table that manages and indexes these pairs. */ +struct di_set { - return (size_t) v % 4; -} + /* Map device numbers to sets of inode number representatives. */ + struct hash_table *dev_map; -static struct di_ent -decode_ptr (struct di_ent const *v) -{ - if (!is_encoded_ptr (v)) - return *v; + /* If nonnull, map large inode numbers to their small + representatives. If null, there are no large inode numbers in + this set. */ + struct ino_map *ino_map; - struct di_ent di; - di.u.ptr = (void *) v; - return di; -} + /* Cache of the most recently allocated and otherwise-unused storage + for probing this table. */ + struct di_ent *probe; +}; +/* Hash a device-inode-set entry. */ static size_t di_ent_hash (void const *x, size_t table_size) { - struct di_ent e = decode_ptr (x); - return (e.u.di4.mode == DI_MODE_4 - ? e.u.u32 - : (e.u.di4.mode == DI_MODE_8 - ? e.u.u64 - : e.u.full.ino)) % table_size; + struct di_ent const *p = x; + dev_t dev = p->dev; + + /* Exclusive-OR the words of DEV into H. This avoids loss of info, + without using a wider % that could be quite slow. */ + size_t h = dev; + int i; + for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) + h ^= dev >>= CHAR_BIT * sizeof h; + + return h % table_size; } -/* Compare two di_ent structs. - Return true if they are the same. */ +/* Return true if two device-inode-set entries are the same. */ static bool di_ent_compare (void const *x, void const *y) { - struct di_ent a = decode_ptr (x); - struct di_ent b = decode_ptr (y); - if (a.u.di4.mode != b.u.di4.mode) - return false; - - if (a.u.di4.mode == DI_MODE_4) - return (a.u.di4.short_ino == b.u.di4.short_ino - && a.u.di4.mapped_dev == b.u.di4.mapped_dev); - - if (a.u.di8.mode == DI_MODE_8) - return (a.u.di8.short_ino == b.u.di8.short_ino - && a.u.di8.mapped_dev == b.u.di8.mapped_dev); - - return (a.u.full.ino == b.u.full.ino - && a.u.full.dev == b.u.full.dev); + struct di_ent const *a = x; + struct di_ent const *b = y; + return a->dev == b->dev; } +/* Free a device-inode-set entry. */ static void di_ent_free (void *v) { - if ( ! is_encoded_ptr (v)) - free (v); + struct di_ent *a = v; + hash_free (a->ino_set); + free (a); } -int -di_set_init (struct di_set_state *dis, size_t initial_size) +/* Create a set of device-inode pairs. Return NULL on allocation failure. */ +struct di_set * +di_set_alloc (void) { - if (dev_map_init (&dis->dev_map) < 0) - return -1; + struct di_set *dis = malloc (sizeof *dis); + if (dis) + { + enum { INITIAL_DEV_MAP_SIZE = 11 }; + dis->dev_map = hash_initialize (INITIAL_DEV_MAP_SIZE, NULL, + di_ent_hash, di_ent_compare, + di_ent_free); + if (! dis->dev_map) + { + free (dis); + return NULL; + } + dis->ino_map = NULL; + dis->probe = NULL; + } - dis->di_set = hash_initialize (initial_size, NULL, - di_ent_hash, di_ent_compare, di_ent_free); - return dis->di_set ? 0 : -1; + return dis; } +/* Free a set of device-inode pairs. */ void -di_set_free (struct di_set_state *dis) +di_set_free (struct di_set *dis) { - dev_map_free (&dis->dev_map); - hash_free (dis->di_set); + hash_free (dis->dev_map); + free (dis->ino_map); + free (dis->probe); + free (dis); } -/* Given a device-inode set, DIS, create an entry for the DEV,INO - pair, and store it in *V. If possible, encode DEV,INO into the pointer - itself, but if not, allocate space for a full "struct di_ent" and set *V - to that pointer. Upon memory allocation failure, return -1. - Otherwise return 0. */ -int -di_ent_create (struct di_set_state *dis, - dev_t dev, ino_t ino, - struct di_ent **v) +/* Hash an encoded inode number I. */ +static size_t +di_ino_hash (void const *i, size_t table_size) { - static int prev_m = -1; - static dev_t prev_dev = -1; - struct di_ent di_ent; - int mapped_dev; + return (hashint) i % table_size; +} - if (dev == prev_dev) - mapped_dev = prev_m; +/* Using the DIS table, map a device to a hash table that represents + a set of inode numbers. Return NULL on error. */ +static struct hash_table * +map_device (struct di_set *dis, dev_t dev) +{ + /* Find space for the probe, reusing the cache if available. */ + struct di_ent *ent; + struct di_ent *probe = dis->probe; + if (probe) + { + /* If repeating a recent query, return the cached result. */ + if (probe->dev == dev) + return probe->ino_set; + } else { - mapped_dev = dev_map_insert (&dis->dev_map, dev); - if (mapped_dev < 0) - return -1; - prev_dev = dev; - prev_m = mapped_dev; + dis->probe = probe = malloc (sizeof *probe); + if (! probe) + return NULL; } - if (mapped_dev < (1 << N_DEV_BITS_4) - && ino < (1 << N_INO_BITS_4)) + /* Probe for the device. */ + probe->dev = dev; + ent = hash_insert (dis->dev_map, probe); + if (! ent) + return NULL; + + if (ent != probe) { -#if lint - /* When this struct is smaller than a pointer, initialize - the pointer so tools like valgrind don't complain about - the uninitialized bytes. */ - if (sizeof di_ent.u.di4 < sizeof di_ent.u.ptr) - di_ent.u.ptr = NULL; -#endif - di_ent.u.di4.mode = DI_MODE_4; - di_ent.u.di4.short_ino = ino; - di_ent.u.di4.mapped_dev = mapped_dev; - di_ent.u.di4.always_set = 1; - *v = di_ent.u.ptr; + /* Use the existing entry. */ + probe->ino_set = ent->ino_set; } - else if (mapped_dev < (1 << N_DEV_BITS_8) - && ino < ((uint64_t) 1 << N_INO_BITS_8)) + else { - di_ent.u.di8.mode = DI_MODE_8; - di_ent.u.di8.short_ino = ino; - di_ent.u.di8.mapped_dev = mapped_dev; - di_ent.u.di8.always_set = 1; - *v = di_ent.u.ptr; + enum { INITIAL_INO_SET_SIZE = 1021 }; + + /* Prepare to allocate a new probe next time; this one is in use. */ + dis->probe = NULL; + + /* DEV is new; allocate an inode set for it. */ + probe->ino_set = hash_initialize (INITIAL_INO_SET_SIZE, NULL, + di_ino_hash, NULL, NULL); } - else + + return probe->ino_set; +} + +/* Using the DIS table, map an inode number to a mapped value. + Return INO_MAP_INSERT_FAILURE on error. */ +static hashint +map_inode_number (struct di_set *dis, ino_t ino) +{ + if (0 < ino && ino < LARGE_INO_MIN) + return ino; + + if (! dis->ino_map) { - /* Handle the case in which INO is too large or in which (far less - likely) we encounter hard-linked files on 2^N_DEV_BITS_8 - different devices. */ - struct di_ent *p = malloc (sizeof *p); - if (!p) - return -1; - assert ((size_t) p % 4 == 0); - p->u.full.mode = DI_MODE_FULL; - p->u.full.ino = ino; - p->u.full.dev = dev; - *v = p; + dis->ino_map = ino_map_alloc (LARGE_INO_MIN); + if (! dis->ino_map) + return INO_MAP_INSERT_FAILURE; } - return 0; + return ino_map_insert (dis->ino_map, ino); } -/* Attempt to insert the DEV,INO pair into the set, DIS. - If it matches a pair already in DIS, don't modify DIS and return 0. +/* Attempt to insert the DEV,INO pair into the set DIS. + If it matches a pair already in DIS, keep that pair and return 0. Otherwise, if insertion is successful, return 1. Upon any failure return -1. */ int -di_set_insert (struct di_set_state *dis, dev_t dev, ino_t ino) +di_set_insert (struct di_set *dis, dev_t dev, ino_t ino) { - struct di_ent *v; - if (di_ent_create (dis, dev, ino, &v) < 0) - return -1; + hashint i; - int err = hash_insert0 (dis->di_set, v, NULL); - if (err == -1) /* Insertion failed due to lack of memory. */ + /* Map the device number to a set of inodes. */ + struct hash_table *ino_set = map_device (dis, dev); + if (! ino_set) return -1; - if (err == 1) /* Insertion succeeded. */ - return 1; - - /* That pair is already in the table, so ENT was not inserted. Free it. */ - if (! is_encoded_ptr (v)) - free (v); + /* Map the inode number to a small representative I. */ + i = map_inode_number (dis, ino); + if (i == INO_MAP_INSERT_FAILURE) + return -1; - return 0; + /* Put I into the inode set. */ + return hash_insert0 (ino_set, (void *) i, NULL); } diff --git a/gl/lib/di-set.h b/gl/lib/di-set.h index f90d0dd..d30a31e 100644 --- a/gl/lib/di-set.h +++ b/gl/lib/di-set.h @@ -1,28 +1,12 @@ -#include "dev-map.h" - -struct di_set_state -{ - /* A map to help compact device numbers. */ - struct dev_map dev_map; - - /* A set of compact dev,inode pairs. */ - struct hash_table *di_set; -}; +#include <sys/types.h> #undef _ATTRIBUTE_NONNULL_ #if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ -# define _ATTRIBUTE_NONNULL_(m,...) __attribute__ ((__nonnull__ (m))) +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) #else -# define _ATTRIBUTE_NONNULL_(m,...) +# define _ATTRIBUTE_NONNULL_(m) #endif -int di_set_init (struct di_set_state *, size_t) _ATTRIBUTE_NONNULL_ (1); -void di_set_free (struct di_set_state *) _ATTRIBUTE_NONNULL_ (1); -int di_set_insert (struct di_set_state *, dev_t, ino_t) - _ATTRIBUTE_NONNULL_ (1); - -struct di_ent; -int di_ent_create (struct di_set_state *di_set_state, - dev_t dev, ino_t ino, - struct di_ent **di_ent) - _ATTRIBUTE_NONNULL_ (1,4); +struct di_set *di_set_alloc (void); +int di_set_insert (struct di_set *, dev_t, ino_t) _ATTRIBUTE_NONNULL_ (1); +void di_set_free (struct di_set *) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c new file mode 100644 index 0000000..c868983 --- /dev/null +++ b/gl/lib/ino-map.c @@ -0,0 +1,162 @@ +/* Map an ino_t inode number to a small integer. + + Copyright 2009, 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* written by Paul Eggert and Jim Meyering */ + +#include <config.h> +#include "ino-map.h" + +#include "hash.h" +#include "verify.h" + +#include <limits.h> +#include <stdlib.h> + +/* A pair that maps an inode number to a mapped inode number; the + latter is a small unique ID for the former. */ +struct ino_map_ent +{ + ino_t ino; + size_t mapped_ino; +}; + +/* A table that manages and indexes these pairs. */ +struct ino_map +{ + /* A table of KEY,VAL pairs, where KEY is the raw ino_t value and + VAL is the small number that it maps to. */ + struct hash_table *map; + + /* The next mapped inode number to hand out. */ + size_t next_mapped_ino; + + /* Cache of the most recently allocated and otherwise-unused storage + for probing the table. */ + struct ino_map_ent *probe; +}; + +/* Hash an inode map entry. */ +static size_t +ino_hash (void const *x, size_t table_size) +{ + struct ino_map_ent const *p = x; + ino_t ino = p->ino; + + /* Exclusive-OR the words of INO into H. This avoids loss of info, + without using a wider % that could be quite slow. */ + size_t h = ino; + int i; + for (i = 1; i < sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); i++) + h ^= ino >>= CHAR_BIT * sizeof h; + + return h % table_size; +} + +/* Return true if two inode map entries are the same. */ +static bool +ino_compare (void const *x, void const *y) +{ + struct ino_map_ent const *a = x; + struct ino_map_ent const *b = y; + return a->ino == b->ino; +} + +/* Allocate an inode map that will hand out integers starting with + NEXT_MAPPED_INO. Return NULL if memory is exhausted. */ +struct ino_map * +ino_map_alloc (size_t next_mapped_ino) +{ + struct ino_map *im = malloc (sizeof *im); + + if (im) + { + enum { INITIAL_INO_MAP_TABLE_SIZE = 1021 }; + im->map = hash_initialize (INITIAL_INO_MAP_TABLE_SIZE, NULL, + ino_hash, ino_compare, free); + if (! im->map) + { + free (im); + return NULL; + } + im->next_mapped_ino = next_mapped_ino; + im->probe = NULL; + } + + return im; +} + +/* Free an inode map. */ +void +ino_map_free (struct ino_map *map) +{ + hash_free (map->map); + free (map->probe); + free (map); +} + + +/* Insert into MAP the inode number INO if it's not there already, + and return its nonnegative mapped inode number. + If INO is already in MAP, return the existing mapped inode number. + Return INO_MAP_INSERT_FAILURE on memory or counter exhaustion. */ +size_t +ino_map_insert (struct ino_map *im, ino_t ino) +{ + struct ino_map_ent *ent; + + /* Find space for the probe, reusing the cache if available. */ + struct ino_map_ent *probe = im->probe; + if (probe) + { + /* If repeating a recent query, return the cached result. */ + if (probe->ino == ino) + return probe->mapped_ino; + } + else + { + im->probe = probe = malloc (sizeof *probe); + if (! probe) + return INO_MAP_INSERT_FAILURE; + } + + probe->ino = ino; + ent = hash_insert (im->map, probe); + if (! ent) + return INO_MAP_INSERT_FAILURE; + + if (ent != probe) + { + /* Use the existing entry. */ + probe->mapped_ino = ent->mapped_ino; + } + else + { + /* If adding 1 to map->next_mapped_ino would cause it to + overflow to zero, then it must equal INO_MAP_INSERT_FAILURE, + which is the value that should be returned in that case. + Verify that this works. */ + verify (INO_MAP_INSERT_FAILURE + 1 == 0); + + /* Prepare to allocate a new probe next time; this one is in use. */ + im->probe = NULL; + + /* INO is new; allocate a mapped inode number for it. */ + probe->mapped_ino = im->next_mapped_ino++; + } + + return probe->mapped_ino; +} diff --git a/gl/lib/ino-map.h b/gl/lib/ino-map.h new file mode 100644 index 0000000..661b78a --- /dev/null +++ b/gl/lib/ino-map.h @@ -0,0 +1,14 @@ +#include <sys/types.h> + +#undef _ATTRIBUTE_NONNULL_ +#if __GNUC__ == 3 && __GNUC_MINOR__ >= 3 || 3 < __GNUC__ +# define _ATTRIBUTE_NONNULL_(m) __attribute__ ((__nonnull__ (m))) +#else +# define _ATTRIBUTE_NONNULL_(m) +#endif + +#define INO_MAP_INSERT_FAILURE ((size_t) -1) + +struct ino_map *ino_map_alloc (size_t); +void ino_map_free (struct ino_map *) _ATTRIBUTE_NONNULL_ (1); +size_t ino_map_insert (struct ino_map *, ino_t) _ATTRIBUTE_NONNULL_ (1); diff --git a/gl/modules/dev-map b/gl/modules/dev-map deleted file mode 100644 index 91f437b..0000000 --- a/gl/modules/dev-map +++ /dev/null @@ -1,23 +0,0 @@ -Description: -maintain a mapping of dev_t numbers to small integers - -Files: -lib/dev-map.c -lib/dev-map.h - -Depends-on: -hash - -configure.ac: - -Makefile.am: -lib_SOURCES += dev-map.c dev-map.h - -Include: -"dev-map.h" - -License -GPL - -Maintainer: -Jim Meyering diff --git a/gl/modules/dev-map-tests b/gl/modules/dev-map-tests deleted file mode 100644 index 4bec2e6..0000000 --- a/gl/modules/dev-map-tests +++ /dev/null @@ -1,10 +0,0 @@ -Files: -tests/test-dev-map.c - -Depends-on: - -configure.ac: - -Makefile.am: -TESTS += test-dev-map -check_PROGRAMS += test-dev-map diff --git a/gl/modules/di-set b/gl/modules/di-set index fe52778..562db14 100644 --- a/gl/modules/di-set +++ b/gl/modules/di-set @@ -6,9 +6,8 @@ lib/di-set.c lib/di-set.h Depends-on: -dev-map +ino-map hash -verify configure.ac: diff --git a/gl/modules/ino-map b/gl/modules/ino-map new file mode 100644 index 0000000..06ee51d --- /dev/null +++ b/gl/modules/ino-map @@ -0,0 +1,24 @@ +Description: +maintain a mapping of ino_t numbers to small integers + +Files: +lib/ino-map.c +lib/ino-map.h + +Depends-on: +hash +verify + +configure.ac: + +Makefile.am: +lib_SOURCES += ino-map.c ino-map.h + +Include: +"ino-map.h" + +License +GPL + +Maintainer: +Jim Meyering diff --git a/gl/modules/ino-map-tests b/gl/modules/ino-map-tests new file mode 100644 index 0000000..fa10b4f --- /dev/null +++ b/gl/modules/ino-map-tests @@ -0,0 +1,10 @@ +Files: +tests/test-ino-map.c + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-ino-map +check_PROGRAMS += test-ino-map diff --git a/gl/tests/test-dev-map.c b/gl/tests/test-dev-map.c deleted file mode 100644 index 98ba630..0000000 --- a/gl/tests/test-dev-map.c +++ /dev/null @@ -1,63 +0,0 @@ -/* Test the dev-map module. - Copyright (C) 2010 Free Software Foundation, Inc. - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see <http://www.gnu.org/licenses/>. */ - -/* Written by Jim Meyering. */ - -#include <config.h> -#include <stdlib.h> -#include <stdio.h> -#include <string.h> - -/* FIXME: once/if in gnulib, use #include "macros.h" in place of this */ -#define ASSERT(expr) \ - do \ - { \ - if (!(expr)) \ - { \ - fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ - fflush (stderr); \ - abort (); \ - } \ - } \ - while (0) - -#include "dev-map.h" - -/* Risky: this is also defined in di-set.c; they should match. */ -#define N_DEV_BITS_4 5 - -int -main () -{ - /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ - struct dev_map dev_map; - ASSERT (dev_map_init (&dev_map) == 0); - - ASSERT (dev_map_insert (&dev_map, 42) == 0); - ASSERT (dev_map_insert (&dev_map, 42) == 0); - ASSERT (dev_map_insert (&dev_map, 398) == 1); - ASSERT (dev_map_insert (&dev_map, 398) == 1); - - int32_t i; - for (i = 0; i < (1 << N_DEV_BITS_4); i++) - { - ASSERT (dev_map_insert (&dev_map, 10000+i) == 2+i); - } - - dev_map_free (&dev_map); - - return 0; -} diff --git a/gl/tests/test-di-set.c b/gl/tests/test-di-set.c index 4d5823e..e5fb6cb 100644 --- a/gl/tests/test-di-set.c +++ b/gl/tests/test-di-set.c @@ -1,4 +1,4 @@ -/* Test the dev-map module. +/* Test the di-set module. Copyright (C) 2010 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify @@ -36,41 +36,21 @@ #include "di-set.h" -/* FIXME: ugly duplication of code from di-set.c */ -#define N_DEV_BITS_4 5 -#define N_INO_BITS_4 (32 - N_DEV_BITS_4 - 2 - 1) - -#define N_DEV_BITS_8 8 -#define N_INO_BITS_8 (64 - N_DEV_BITS_8 - 2 - 1) - int main (void) { /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ - size_t initial_size = 61; - /* "real" code might prefer to avoid the allocation here, simply - declaring "struct di_set_state dis;", do a global substitution, - s/\<dis\>/\&dis/, and remove the final free. */ - struct di_set_state *dis = malloc (sizeof *dis); + struct di_set *dis = di_set_alloc (); ASSERT (dis); - ASSERT (di_set_init (dis, initial_size) == 0); - - struct di_ent *di_ent; - ASSERT (di_ent_create (dis, 1, 1, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1 << N_DEV_BITS_4, 1, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1, 1 << N_INO_BITS_4, &di_ent) == 0); - ASSERT (di_ent_create (dis, 1, - (uint64_t) 1 << N_INO_BITS_8, &di_ent) == 0); - free (di_ent); ASSERT (di_set_insert (dis, 2, 5) == 1); /* first insertion succeeds */ ASSERT (di_set_insert (dis, 2, 5) == 0); /* duplicate fails */ ASSERT (di_set_insert (dis, 3, 5) == 1); /* diff dev, duplicate inode is ok */ ASSERT (di_set_insert (dis, 2, 8) == 1); /* same dev, different inode is ok */ - /* very large inode number */ - ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 1); - ASSERT (di_set_insert (dis, 5, (uint64_t) 1 << 63) == 0); /* dup */ + /* very large (or negative) inode number */ + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 1); + ASSERT (di_set_insert (dis, 5, (ino_t) -1) == 0); /* dup */ unsigned int i; for (i = 0; i < 3000; i++) @@ -79,7 +59,6 @@ main (void) ASSERT (di_set_insert (dis, 9, i) == 0); /* duplicate fails */ di_set_free (dis); - free (dis); return 0; } diff --git a/gl/tests/test-ino-map.c b/gl/tests/test-ino-map.c new file mode 100644 index 0000000..2b44602 --- /dev/null +++ b/gl/tests/test-ino-map.c @@ -0,0 +1,63 @@ +/* Test the ino-map module. + Copyright (C) 2010 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. */ + +/* Written by Jim Meyering. */ + +#include <config.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +/* FIXME: once/if in gnulib, use #include "macros.h" in place of this */ +#define ASSERT(expr) \ + do \ + { \ + if (!(expr)) \ + { \ + fprintf (stderr, "%s:%d: assertion failed\n", __FILE__, __LINE__); \ + fflush (stderr); \ + abort (); \ + } \ + } \ + while (0) + +#include "ino-map.h" + +int +main () +{ + /* set_program_name (argv[0]); placate overzealous "syntax-check" test. */ + enum { INO_MAP_INIT = 123 }; + struct ino_map *ino_map = ino_map_alloc (INO_MAP_INIT); + ASSERT (ino_map != NULL); + + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 42) == INO_MAP_INIT); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 398) == INO_MAP_INIT + 1); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + ASSERT (ino_map_insert (ino_map, 0) == INO_MAP_INIT + 2); + + int i; + for (i = 0; i < 100; i++) + { + ASSERT (ino_map_insert (ino_map, 10000 + i) == INO_MAP_INIT + 3 + i); + } + + ino_map_free (ino_map); + + return 0; +} diff --git a/src/du.c b/src/du.c index 7c02c86..739be73 100644 --- a/src/du.c +++ b/src/du.c @@ -60,11 +60,8 @@ extern bool fts_debug; # define FTS_CROSS_CHECK(Fts) #endif -/* Initial size of the hash table. */ -enum { INITIAL_DI_SET_SIZE = 1021 }; - /* A set of dev/ino pairs. */ -static struct di_set_state di_set; +static struct di_set *di_set; /* Define a class for collecting directory information. */ @@ -337,14 +334,10 @@ Mandatory arguments to long options are mandatory for short options too.\n\ static bool hash_ins (ino_t ino, dev_t dev) { - int inserted = di_set_insert (&di_set, dev, ino); + int inserted = di_set_insert (di_set, dev, ino); if (inserted < 0) - { - /* Insertion failed due to lack of memory. */ - xalloc_die (); - } - - return inserted ? true : false; + xalloc_die (); + return inserted; } /* FIXME: this code is nearly identical to code in date.c */ @@ -905,7 +898,8 @@ main (int argc, char **argv) xalloc_die (); /* Initialize the set of dev,inode pairs. */ - if (di_set_init (&di_set, INITIAL_DI_SET_SIZE)) + di_set = di_set_alloc (); + if (!di_set) xalloc_die (); bit_flags |= symlink_deref_bits; @@ -977,6 +971,7 @@ main (int argc, char **argv) } argv_iter_free (ai); + di_set_free (di_set); if (files_from && (ferror (stdin) || fclose (stdin) != 0)) error (EXIT_FAILURE, 0, _("error reading %s"), quote (files_from)); @@ -984,7 +979,5 @@ main (int argc, char **argv) if (print_grand_total) print_size (&tot_dui, _("total")); - di_set_free (&di_set); - exit (ok ? EXIT_SUCCESS : EXIT_FAILURE); } -- 1.7.0.4
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Tue, 06 Jul 2010 22:19:02 GMT) Full text and rfc822 format available.Message #74 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Wed, 07 Jul 2010 00:18:17 +0200
Paul Eggert wrote: > On 07/04/10 03:00, Jim Meyering wrote: > >> Paul, one question I have is whether to include an initial_inode_set_size >> parameter in your di_set_alloc function. That would make it more >> efficient in the event that an application happens to know in advance the >> number of inodes it will process (say if it's processing a single device >> and all of its inodes). However, there's no rush to address that -- >> you're welcome to push this as-is. > > Thanks. Yes, the extra parameter would help for that; I had omitted > it because I couldn't think of a useful real-world case. For now I pushed > it without that change (please see below). This incorporates all your other > suggestions, plus the performance improvement with 32-bit % that I mentioned > earlier today. > >>From f6e2e13d3fee5219a6daece59ec9cc6241a04813 Mon Sep 17 00:00:00 2001 > From: Paul Eggert <eggert <at> cs.ucla.edu> > Date: Tue, 6 Jul 2010 14:53:14 -0700 > Subject: [PATCH] du: Hash with a mechanism that's simpler and takes less memory. Thanks! A minor nit for next time: du: Hash with a mechanism that's simpler and takes less memory. please do not capitalize and omit the period on the one-line summary One more thing: It fails to compile on x86_64 with -Werror: cc1: warnings being treated as errors di-set.c: In function 'di_ent_hash': di-set.c:86: error: right shift count >= width of type make[4]: *** [di-set.o] Error 1
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Tue, 06 Jul 2010 23:23:01 GMT) Full text and rfc822 format available.Message #77 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Tue, 06 Jul 2010 16:22:50 -0700
On 07/06/10 15:18, Jim Meyering wrote: > A minor nit for next time: > du: Hash with a mechanism that's simpler and takes less memory. > please do not capitalize and omit the period on the one-line summary Sorry, I'll try to remember better next time. > It fails to compile on x86_64 with -Werror: ... > di-set.c:86: error: right shift count >= width of type That's an incorrect warning, since the code is unreachable on that platform and the compiler should know that it's unreachable. This bug has been present in GCC for ages, with no signs of it ever getting fixed. In this particular case it's fairly easy to work around with no runtime penalty assuming reasonable optimization (except perhaps on weird hosts where sizeof (ino_t) > 2 * sizeof (size_t)) so I installed this further patch: From db4df7dda6e209e3e38fe69298624ffe92d392c7 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert <at> cs.ucla.edu> Date: Tue, 6 Jul 2010 16:16:20 -0700 Subject: [PATCH] du: avoid spurious warnings with 64-bit gcc -W Problem reported by Jim Meyering in: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6524#74 * gl/lib/di-set.c (di_ent_hash): Rework so that the compiler does not incorrectly warn about shifting by 64-bits in unreachable code. * gl/lib/ino-map.c (ino_hash): Likewise. --- gl/lib/di-set.c | 2 +- gl/lib/ino-map.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index e0e2b24..ba44bcf 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -83,7 +83,7 @@ di_ent_hash (void const *x, size_t table_size) size_t h = dev; int i; for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) - h ^= dev >>= CHAR_BIT * sizeof h; + h ^= dev >> CHAR_BIT * sizeof h * i; return h % table_size; } diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c index c868983..cc9a131 100644 --- a/gl/lib/ino-map.c +++ b/gl/lib/ino-map.c @@ -61,7 +61,7 @@ ino_hash (void const *x, size_t table_size) size_t h = ino; int i; for (i = 1; i < sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); i++) - h ^= ino >>= CHAR_BIT * sizeof h; + h ^= ino >> CHAR_BIT * sizeof h * i; return h % table_size; } -- 1.7.0.4
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Wed, 07 Jul 2010 10:05:01 GMT) Full text and rfc822 format available.Message #80 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Wed, 07 Jul 2010 12:03:58 +0200
Paul Eggert wrote: ... >> It fails to compile on x86_64 with -Werror: ... >> di-set.c:86: error: right shift count >= width of type > > That's an incorrect warning, since the code is unreachable on > that platform and the compiler should know that it's unreachable. > This bug has been present in GCC for ages, with no signs of it > ever getting fixed. In this particular case it's fairly easy > to work around with no runtime penalty assuming reasonable > optimization (except perhaps on weird hosts where > sizeof (ino_t) > 2 * sizeof (size_t)) so I installed > this further patch: > > Subject: [PATCH] du: avoid spurious warnings with 64-bit gcc -W > > Problem reported by Jim Meyering in: > http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6524#74 > * gl/lib/di-set.c (di_ent_hash): Rework so that the compiler does > not incorrectly warn about shifting by 64-bits in unreachable code. > * gl/lib/ino-map.c (ino_hash): Likewise. > --- > gl/lib/di-set.c | 2 +- > gl/lib/ino-map.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c > index e0e2b24..ba44bcf 100644 > --- a/gl/lib/di-set.c > +++ b/gl/lib/di-set.c > @@ -83,7 +83,7 @@ di_ent_hash (void const *x, size_t table_size) > size_t h = dev; > int i; > for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) > - h ^= dev >>= CHAR_BIT * sizeof h; > + h ^= dev >> CHAR_BIT * sizeof h * i; Good one. Thanks for the quick fix. I prefer to use unsigned types when appropriate, and adjusted a couple of other things for readability. I'll push something like the following later today. Do you know of any system on which sizeof dev_t is larger than two times sizeof size_t? I was wondering if the added (sizeof dev % sizeof h != 0) term is actually necessary anywhere. The largest dev_t I've seen is 8 bytes wide. Just curious. I know it is required, just in case. From 52df9e9f77b6cef49f8c42ee1d17bc0fe5448ff8 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Wed, 7 Jul 2010 11:57:26 +0200 Subject: [PATCH] di-set, ino-map: adjust a type, improve readability * gl/lib/ino-map.c (ino_hash): Declare "i" as unsigned int. Use an intermediate variable for the for-loop upper bound, so it's a little more readable. Adjust comment. * gl/lib/di-set.c (di_ent_hash): Likewise. --- gl/lib/di-set.c | 10 ++++++---- gl/lib/ino-map.c | 10 ++++++---- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index ba44bcf..231c636 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -78,11 +78,13 @@ di_ent_hash (void const *x, size_t table_size) struct di_ent const *p = x; dev_t dev = p->dev; - /* Exclusive-OR the words of DEV into H. This avoids loss of info, - without using a wider % that could be quite slow. */ + /* When DEV is wider than size_t, exclusive-OR the words of DEV into H. + This avoids loss of info, without applying % to the wider type, + which could be quite slow on some systems. */ size_t h = dev; - int i; - for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) + unsigned int i; + unsigned int n_words = sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); + for (i = 1; i < n_words; i++) h ^= dev >> CHAR_BIT * sizeof h * i; return h % table_size; diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c index cc9a131..f6fdd63 100644 --- a/gl/lib/ino-map.c +++ b/gl/lib/ino-map.c @@ -56,11 +56,13 @@ ino_hash (void const *x, size_t table_size) struct ino_map_ent const *p = x; ino_t ino = p->ino; - /* Exclusive-OR the words of INO into H. This avoids loss of info, - without using a wider % that could be quite slow. */ + /* When INO is wider than size_t, exclusive-OR the words of INO into H. + This avoids loss of info, without applying % to the wider type, + which could be quite slow on some systems. */ size_t h = ino; - int i; - for (i = 1; i < sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); i++) + unsigned int i; + unsigned int n_words = sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); + for (i = 1; i < n_words; i++) h ^= ino >> CHAR_BIT * sizeof h * i; return h % table_size; -- 1.7.2.rc1.192.g262ff
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Wed, 07 Jul 2010 12:51:02 GMT) Full text and rfc822 format available.Message #83 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> CS.UCLA.EDU> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Wed, 07 Jul 2010 14:49:49 +0200
Jim Meyering wrote: ... > I prefer to use unsigned types when appropriate, > and adjusted a couple of other things for readability. > I'll push something like the following later today. ... > Subject: [PATCH] di-set, ino-map: adjust a type, improve readability ... > size_t h = dev; > - int i; > - for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) > + unsigned int i; > + unsigned int n_words = sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); It'd be good if I were to compile before posting a patch ;-) Those uses of "ino" should be "dev", of course. From 8217117f3cb4ee39f8913fcfdef8b3d64dfd2437 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyering <at> redhat.com> Date: Wed, 7 Jul 2010 11:57:26 +0200 Subject: [PATCH] di-set, ino-map: adjust a type, improve readability * gl/lib/ino-map.c (ino_hash): Declare "i" as unsigned int. Use an intermediate variable for the for-loop upper bound, so it's a little more readable. Adjust comment. * gl/lib/di-set.c (di_ent_hash): Likewise. --- gl/lib/di-set.c | 10 ++++++---- gl/lib/ino-map.c | 10 ++++++---- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/gl/lib/di-set.c b/gl/lib/di-set.c index ba44bcf..892950d 100644 --- a/gl/lib/di-set.c +++ b/gl/lib/di-set.c @@ -78,11 +78,13 @@ di_ent_hash (void const *x, size_t table_size) struct di_ent const *p = x; dev_t dev = p->dev; - /* Exclusive-OR the words of DEV into H. This avoids loss of info, - without using a wider % that could be quite slow. */ + /* When DEV is wider than size_t, exclusive-OR the words of DEV into H. + This avoids loss of info, without applying % to the wider type, + which could be quite slow on some systems. */ size_t h = dev; - int i; - for (i = 1; i < sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); i++) + unsigned int i; + unsigned int n_words = sizeof dev / sizeof h + (sizeof dev % sizeof h != 0); + for (i = 1; i < n_words; i++) h ^= dev >> CHAR_BIT * sizeof h * i; return h % table_size; diff --git a/gl/lib/ino-map.c b/gl/lib/ino-map.c index cc9a131..f6fdd63 100644 --- a/gl/lib/ino-map.c +++ b/gl/lib/ino-map.c @@ -56,11 +56,13 @@ ino_hash (void const *x, size_t table_size) struct ino_map_ent const *p = x; ino_t ino = p->ino; - /* Exclusive-OR the words of INO into H. This avoids loss of info, - without using a wider % that could be quite slow. */ + /* When INO is wider than size_t, exclusive-OR the words of INO into H. + This avoids loss of info, without applying % to the wider type, + which could be quite slow on some systems. */ size_t h = ino; - int i; - for (i = 1; i < sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); i++) + unsigned int i; + unsigned int n_words = sizeof ino / sizeof h + (sizeof ino % sizeof h != 0); + for (i = 1; i < n_words; i++) h ^= ino >> CHAR_BIT * sizeof h * i; return h % table_size; -- 1.7.2.rc1.218.gca56a
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Thu, 08 Jul 2010 18:19:02 GMT) Full text and rfc822 format available.Message #86 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Thu, 08 Jul 2010 11:18:25 -0700
On 07/07/10 03:03, Jim Meyering wrote: > Do you know of any system on which sizeof dev_t is larger > than two times sizeof size_t? I was wondering if the added > (sizeof dev % sizeof h != 0) term is actually necessary anywhere. > The largest dev_t I've seen is 8 bytes wide. Just curious. > I know it is required, just in case. I don't know of any, no. I imagine there may be some hosts that have 16-bit size_t and 64-bit ino_t, but the GNU coding standards say that we don't have to worry about such hosts. The general-case code is mainly there to document the general case and to make it clear what the intent is: I can't imagine any host where the (sizeof dev % sizeof h != 0) term is actually needed. It's just a hash function, so the code will still work even if the loop is removed completely; but it won't hash nearly as well on Cygwin, I expect.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Thu, 08 Jul 2010 18:21:02 GMT) Full text and rfc822 format available.Message #89 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> CS.UCLA.EDU> To: Jim Meyering <jim <at> meyering.net> Cc: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Thu, 08 Jul 2010 11:20:30 -0700
On 07/07/10 03:03, Jim Meyering wrote: > I prefer to use unsigned types when appropriate, > and adjusted a couple of other things for readability. Thanks, that change looks good of course (after the later replacement of ino with dev). I'll try to keep the preference for unsigned in mind in later patches.
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:bug#6524
; Package coreutils
.
(Mon, 12 Jul 2010 13:07:02 GMT) Full text and rfc822 format available.Message #92 received at 6524 <at> debbugs.gnu.org (full text, mbox):
From: Pádraig Brady <P <at> draigBrady.com> To: 6524 <at> debbugs.gnu.org Subject: Re: bug#6524: du now uses less than half as much memory, sometimes Date: Mon, 12 Jul 2010 14:05:51 +0100
On 27/06/10 23:09, Jim Meyering wrote: > FYI, > * NEWS (New features): Mention it. Just a minor point. I'll move this NEWS to the "Changes in behavior" section, since there are no associated interface changes. I'm a little more undecided about where I'm going to document the concurrent sort performance enhancements. I'm leaning towards a "change in behavior" even though there is a new --parallel option for edge cases. In that way one can quickly scan that section to determine how upgrading may affect the system. cheers, Pádraig.
Jim Meyering <jim <at> meyering.net>
:Jim Meyering <jim <at> meyering.net>
:Message #97 received at 6524-done <at> debbugs.gnu.org (full text, mbox):
From: Jim Meyering <jim <at> meyering.net> To: Pádraig Brady <P <at> draigBrady.com> Cc: Paul Eggert <eggert <at> CS.UCLA.EDU>, 6524-done <at> debbugs.gnu.org Subject: Re: [PATCH] hash: extend module to deal with non-pointer keys Date: Sun, 17 Apr 2011 11:08:45 +0200
Jim Meyering wrote: > Pádraig Brady wrote: >> On 02/07/10 20:17, Jim Meyering wrote: >>> Thanks again for the report. >>> However, while I was able to reproduce it (on Paul's system) >>> and debug it, it appears to be due to a miscompilation of di-set.o >>> when using a private copy of gcc-4.5.0. When I recompiled >>> that one file with gcc-Ubuntu 4.4.3-4ubuntu5 and -g -O2 >>> or with 4.5.0 and -g -O, the code works as expected. >> >> I used a private gcc 4.5.0 also, and switched to >> gcc 4.4.1 for debugging where I couldn't reproduce it >> (gcc 4.5.0 produces debugging info with incompatible >> line number info for my gdb (6.8.50)). > > I reported the code gen bug (i686-specific), and learned that > it's fixed on the trunk and for the upcoming 4.5.1: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44806 Paul's change was pushed. Closing.
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Sun, 15 May 2011 11:24:04 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.