GNU bug report logs - #7597
multi-threaded sort can segfault (unrelated to the sort -u segfault)

Previous Next

Package: coreutils;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Thu, 9 Dec 2010 12:11:01 UTC

Severity: normal

Tags: fixed

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 7597 in the body.
You can then email your comments to 7597 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Thu, 09 Dec 2010 12:11:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 09 Dec 2010 12:11:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Chen Guo <chen.guo.0625 <at> gmail.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: multi-threaded sort can segfault (unrelated to the sort -u segfault)
Date: Thu, 09 Dec 2010 12:31:14 +0100
[I've retitled and Cc'd bug-coreutils so this gets its own bug number]

Chen Guo wrote:
> On Tue, Dec 7, 2010 at 3:24 AM, Jim Meyering <jim <at> meyering.net> wrote:
>> Thanks.  What does your input file look like?
>> I've been unable to reproduce the failure using the output of
>> seq 1000000.  I've tried a few different -S ... options, in case
>> the amount of available memory is a factor:
>>
>>  seq 1000000 > in-1M
>>  for i in $(seq 1000); do \
>>    printf '%03d\r' $i; src/sort --parallel=2 -S 1M in-1M > /dev/null; done
>
> The input file I used was generated with gensort
> (http://www.ordinal.com/gensort.html)
> I used the -a 1000000 to generate a 1 million line ASCII file. Running
> sort 10 times on 2 or 4 threads almost always triggered at least 1
> segfault.

Thank you.
With that, I've solved at least part of the problem.
The segfault (and other strangeness we've witnessed)
arises because each "node" struct is stored on the stack,
and its address ends up being used by another thread after
the thread that owns the stack in question has been "joined".

My solution is to use the heap instead of the stack.
However, for today I'm out of time and I have not yet found a
way to free these newly-malloc'd "node" buffers.

To test this, I've done the following:

gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; valgrind --quiet src/sort -S 100K \
  --parallel=2 gensort-10k > k; test $(wc -c < k) = 1000000 || break; done
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
  --parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done

Without the patch, the first would show errors for more than 50% of
the runs and the second would rarely get to i=100 without generating
a core file.  With the patch, both complete error-free (not counting
leaks).

I'll add tests and a NEWS item, eventually.

From e3a0cb08ebcb7ad6638d022418488c1085838fc3 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> redhat.com>
Date: Thu, 9 Dec 2010 12:12:53 +0100
Subject: [PATCH] sort: avoid segfault when using two or more threads

* src/sort.c (sortlines, sort): Store each "node" structure on
the heap, not on the stack.  Otherwise, a node from one thread's
stack could be used in another thread after the first thread had
expired (via pthread_join).
---
 src/sort.c |   56 ++++++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 742971e..9a7ecd4 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3471,28 +3471,28 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
   struct line *hi = lo - nlo;
   struct line **parent_end = lo_child ? &parent->end_lo : &parent->end_hi;

-  struct merge_node node;
-  node.lo = node.end_lo = lo;
-  node.hi = node.end_hi = hi;
-  node.dest = parent_end;
-  node.nlo = nlo;
-  node.nhi = nhi;
-  node.parent = parent;
-  node.level = parent->level + 1;
-  node.queued = false;
-  pthread_mutex_init (&node.lock, NULL);
+  struct merge_node *node = xmalloc (sizeof *node);
+  node->lo = node->end_lo = lo;
+  node->hi = node->end_hi = hi;
+  node->dest = parent_end;
+  node->nlo = nlo;
+  node->nhi = nhi;
+  node->parent = parent;
+  node->level = parent->level + 1;
+  node->queued = false;
+  pthread_mutex_init (&node->lock, NULL);

   /* Calculate thread arguments. */
   unsigned long int lo_threads = nthreads / 2;
   unsigned long int hi_threads = nthreads - lo_threads;
   pthread_t thread;
-  struct thread_args args = {lines, lo, lo_threads, total_lines, &node,
+  struct thread_args args = {lines, lo, lo_threads, total_lines, node,
                              true, queue, tfp, temp_output};

   if (nthreads > 1 && SUBTHREAD_LINES_HEURISTIC <= nlines
       && pthread_create (&thread, NULL, sortlines_thread, &args) == 0)
     {
-      sortlines (lines - nlo, hi, hi_threads, total_lines, &node, false,
+      sortlines (lines - nlo, hi, hi_threads, total_lines, node, false,
                  queue, tfp, temp_output);
       pthread_join (thread, NULL);
     }
@@ -3507,16 +3507,16 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
         sequential_sort (lines, nlo, temp, false);

       /* Update merge NODE. No need to lock yet. */
-      node.lo = lines;
-      node.hi = lines - nlo;
-      node.end_lo = lines - nlo;
-      node.end_hi = lines - nlo - nhi;
+      node->lo = lines;
+      node->hi = lines - nlo;
+      node->end_lo = lines - nlo;
+      node->end_hi = lines - nlo - nhi;

-      queue_insert (queue, &node);
+      queue_insert (queue, node);
       merge_loop (queue, total_lines, tfp, temp_output);
     }

-  pthread_mutex_destroy (&node.lock);
+  pthread_mutex_destroy (&node->lock);
 }

 /* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3793,19 +3793,19 @@ sort (char *const *files, size_t nfiles, char const *output_file,
               struct merge_node_queue queue;
               queue_init (&queue, 2 * nthreads);

-              struct merge_node node;
-              node.lo = node.hi = node.end_lo = node.end_hi = NULL;
-              node.dest = NULL;
-              node.nlo = node.nhi = buf.nlines;
-              node.parent = NULL;
-              node.level = MERGE_END;
-              node.queued = false;
-              pthread_mutex_init (&node.lock, NULL);
+              struct merge_node *node = xmalloc (sizeof *node);
+              node->lo = node->hi = node->end_lo = node->end_hi = NULL;
+              node->dest = NULL;
+              node->nlo = node->nhi = buf.nlines;
+              node->parent = NULL;
+              node->level = MERGE_END;
+              node->queued = false;
+              pthread_mutex_init (&node->lock, NULL);

-              sortlines (line, line, nthreads, buf.nlines, &node, true,
+              sortlines (line, line, nthreads, buf.nlines, node, true,
                          &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_mutex_destroy (&node.lock);
+              pthread_mutex_destroy (&node->lock);
             }
           else
             write_unique (line - 1, tfp, temp_output);
--
1.7.3.3




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Thu, 09 Dec 2010 21:28:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Chen Guo <chen.guo.0625 <at> gmail.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: [coreutils] multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Thu, 09 Dec 2010 22:33:42 +0100
Jim Meyering wrote:
...
> With that, I've solved at least part of the problem.
> The segfault (and other strangeness we've witnessed)
> arises because each "node" struct is stored on the stack,
> and its address ends up being used by another thread after
> the thread that owns the stack in question has been "joined".
>
> My solution is to use the heap instead of the stack.
> However, for today I'm out of time and I have not yet found a
> way to free these newly-malloc'd "node" buffers.
>
> To test this, I've done the following:
>
> gensort -a 10000 > gensort-10k
> for i in $(seq 2000); do printf '% 4d\n' $i; valgrind --quiet src/sort -S 100K \
>   --parallel=2 gensort-10k > k; test $(wc -c < k) = 1000000 || break; done
> for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
>   --parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done
>
> Without the patch, the first would show errors for more than 50% of
> the runs and the second would rarely get to i=100 without generating
> a core file.  With the patch, both complete error-free (not counting
> leaks).

FYI, while preparing a test, I've found that the latter test
(without valgrind) passes 2000/2000 tests when compiled with -g -O2,
yet fails in at least 10 of the 2000 when compiled with -ggdb3.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Fri, 10 Dec 2010 04:01:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: multi-threaded sort can segfault (unrelated to the sort -u
	segfault)
Date: Thu, 09 Dec 2010 10:26:28 -0800
On 12/09/10 03:31, Jim Meyering wrote:
> The segfault (and other strangeness we've witnessed)
> arises because each "node" struct is stored on the stack,
> and its address ends up being used by another thread after
> the thread that owns the stack in question has been "joined".

Ah, of *course*!

> My solution is to use the heap instead of the stack.

This can be simplified by allocating all the nodes at once,
since the number of nodes is bounded above by the number
of threads.  I'll take a look at this, if nobody else gets
to it first.

I am also still looking at the problem with the compression/decompression
hang.  The code to do subprocesses is busted in more than
one way: it does not always catch failures (nonzero exit status)
in decompression, and it can falsely report
failure even when compression and decompression are
both perfectly OK.  However, I still don't have a handle
on the actual bug that causes the hang.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Fri, 10 Dec 2010 21:18:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Chen Guo <chen.guo.0625 <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: coreutils <at> gnu.org, bug-coreutils <at> gnu.org, Jim Meyering <jim <at> meyering.net>,
	DJ Lucas <dj <at> linuxfromscratch.org>
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the sort
	-u segfault)
Date: Fri, 10 Dec 2010 13:23:10 -0800
Hi Professor Eggert,
On Thu, Dec 9, 2010 at 10:26 AM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> This can be simplified by allocating all the nodes at once,
> since the number of nodes is bounded above by the number
> of threads.  I'll take a look at this, if nobody else gets
> to it first.

Got to it first :-)
This approach has the side benefit of letting me refactor that whole
node creation block to its own function, and thus cleaning up
sortlines quite a bit. Over 60 runs each on 2 and 4 threads, I am no
longer seeing any segfaults.

From 837b4ea56ba32220c5c09f21a8ab59e7c71824a9 Mon Sep 17 00:00:00 2001
From: Chen Guo <chenguo4 <at> ucla.edu>
Date: Fri, 10 Dec 2010 13:13:36 -0800
Subject: [PATCH] sort: preallocate merge tree nodes to heap.

* src/sort.c: (merge_tree_init) New function. Allocates memory for
merge tree nodes.
(merge_tree_destory) New function.
(init_node) New function.
(sortlines) Refactor node creation code to init_node. Remove now
superfluous arguments. All callers changed.
(sort) Initialize/destory merge tree. Refactor root node creation
to merge_tree_init.
---
 src/sort.c |  170 +++++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 111 insertions(+), 59 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9cfc814..165d8e8 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -239,6 +239,8 @@ struct merge_node
   size_t nlo;                   /* Total Lines remaining from LO. */
   size_t nhi;                   /* Total lines remaining from HI. */
   struct merge_node *parent;    /* Parent node. */
+  struct merge_node *lo_child;  /* LO child node. */
+  struct merge_node *hi_child;  /* HI child node. */
   unsigned int level;           /* Level in merge tree. */
   bool queued;                  /* Node is already in heap. */
   pthread_mutex_t lock;         /* Lock for node operations. */
@@ -3137,6 +3139,83 @@ sequential_sort (struct line *restrict lines,
size_t nlines,
     }
 }

+struct merge_node * init_node (struct merge_node *, struct merge_node *,
+                               struct line *restrict, unsigned long int,
+                               size_t, bool);
+
+
+/* Initialize the merge tree. */
+static struct merge_node *
+merge_tree_init (unsigned long int nthreads, size_t nlines,
+                 struct line *restrict dest)
+{
+  struct merge_node *merge_tree = xmalloc (2 * nthreads * sizeof *merge_tree);
+
+  struct merge_node *root_node = merge_tree;
+  root_node->lo = root_node->hi = root_node->end_lo = root_node->end_hi = NULL;
+  root_node->dest = NULL;
+  root_node->nlo = root_node->nhi = nlines;
+  root_node->parent = NULL;
+  root_node->level = MERGE_END;
+  root_node->queued = false;
+  pthread_mutex_init (&root_node->lock, NULL);
+
+  init_node (root_node, root_node + 1, dest, nthreads, nlines, false);
+  return merge_tree;
+}
+
+/* Destroy the merge tree. */
+static void
+merge_tree_destroy (struct merge_node *merge_tree)
+{
+  free (merge_tree);
+}
+
+/* Initialize a merge tree node. */
+
+struct merge_node *
+init_node (struct merge_node *parent, struct merge_node *node_pool,
+           struct line *restrict dest, unsigned long int nthreads,
+           size_t total_lines, bool lo_child)
+{
+  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
+  size_t nlo = nlines / 2;
+  size_t nhi = nlines - nlo;
+  struct line *lo = dest - total_lines;
+  struct line *hi = lo - nlo;
+  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
+
+  struct merge_node *node = node_pool++;
+  node->lo = node->end_lo = lo;
+  node->hi = node->end_hi = hi;
+  node->dest = parent_end;
+  node->nlo = nlo;
+  node->nhi = nhi;
+  node->parent = parent;
+  node->level = parent->level + 1;
+  node->queued = false;
+  pthread_mutex_init (&node->lock, NULL);
+
+  if (nthreads > 1)
+    {
+      unsigned long int lo_threads = nthreads / 2;
+      unsigned long int hi_threads = nthreads - lo_threads;
+      node->lo_child = node_pool;
+      node_pool = init_node (node, node_pool, lo, lo_threads,
+                             total_lines, true);
+      node->hi_child = node_pool;
+      node_pool = init_node (node, node_pool, hi, hi_threads,
+                             total_lines, false);
+    }
+  else
+    {
+      node->lo_child = NULL;
+      node->hi_child = NULL;
+    }
+  return node_pool;
+}
+
+
 /* Compare two merge nodes A and B for priority.  */

 static int
@@ -3378,10 +3457,8 @@ merge_loop (struct merge_node_queue *queue,
 }


-static void sortlines (struct line *restrict, struct line *restrict,
-                       unsigned long int, size_t,
-                       struct merge_node *, bool,
-                       struct merge_node_queue *,
+static void sortlines (struct line *restrict, unsigned long int, size_t,
+                       struct merge_node *, bool, struct merge_node_queue *,
                        FILE *, char const *);

 /* Thread arguments for sortlines_thread. */
@@ -3392,19 +3469,15 @@ struct thread_args
      the end of the array.  */
   struct line *lines;

-  /* Destination, i.e., the array of lines to store result into.  This
-     also points just past the end of the array.  */
-  struct line *dest;
-
   /* Number of threads to use.  If 0 or 1, sort single-threaded.  */
   unsigned long int nthreads;

   /* Number of lines in LINES and DEST.  */
   size_t const total_lines;

-  /* Parent node; it will merge this node's output to the output
-     of this node's sibling.  */
-  struct merge_node *const parent;
+  /* Merge node. Lines from this node and this node's sibling will merged
+     to this node's parent. */
+  struct merge_node *const node;

   /* True if this node is sorting the lower half of the parent's work.  */
   bool lo_child;
@@ -3425,9 +3498,9 @@ static void *
 sortlines_thread (void *data)
 {
   struct thread_args const *args = data;
-  sortlines (args->lines, args->dest, args->nthreads, args->total_lines,
-             args->parent, args->lo_child, args->queue,
-             args->tfp, args->output_temp);
+  sortlines (args->lines, args->nthreads, args->total_lines,
+             args->node, args->lo_child, args->queue, args->tfp,
+             args->output_temp);
   return NULL;
 }

@@ -3456,49 +3529,32 @@ sortlines_thread (void *data)
    have been merged. */

 static void
-sortlines (struct line *restrict lines, struct line *restrict dest,
-           unsigned long int nthreads, size_t total_lines,
-           struct merge_node *parent, bool lo_child,
-           struct merge_node_queue *queue,
-           FILE *tfp, char const *temp_output)
+sortlines (struct line *restrict lines, unsigned long int nthreads,
+           size_t total_lines, struct merge_node *node, bool lo_child,
+           struct merge_node_queue *queue, FILE *tfp, char const *temp_output)
 {
-  /* Create merge tree NODE. */
-  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
-  size_t nlo = nlines / 2;
-  size_t nhi = nlines - nlo;
-  struct line *lo = dest - total_lines;
-  struct line *hi = lo - nlo;
-  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
-
-  struct merge_node node;
-  node.lo = node.end_lo = lo;
-  node.hi = node.end_hi = hi;
-  node.dest = parent_end;
-  node.nlo = nlo;
-  node.nhi = nhi;
-  node.parent = parent;
-  node.level = parent->level + 1;
-  node.queued = false;
-  pthread_mutex_init (&node.lock, NULL);
+  size_t nlines = (lo_child)? node->parent->nlo : node->parent->nhi;

   /* Calculate thread arguments. */
   unsigned long int lo_threads = nthreads / 2;
   unsigned long int hi_threads = nthreads - lo_threads;
   pthread_t thread;
-  struct thread_args args = {lines, lo, lo_threads, total_lines, &node,
-                             true, queue, tfp, temp_output};
+  struct thread_args args = {lines, lo_threads, total_lines,
+                             node->lo_child, true, queue, tfp, temp_output};

   if (nthreads > 1 && SUBTHREAD_LINES_HEURISTIC <= nlines
       && pthread_create (&thread, NULL, sortlines_thread, &args) == 0)
     {
-      sortlines (lines - nlo, hi, hi_threads, total_lines, &node, false,
-                 queue, tfp, temp_output);
+      sortlines (lines - node->nlo, hi_threads, total_lines,
+                 node->hi_child, false, queue, tfp, temp_output);
       pthread_join (thread, NULL);
     }
   else
     {
       /* Nthreads = 1, this is a leaf NODE, or pthread_create failed.
          Sort with 1 thread. */
+      size_t nlo = node->nlo;
+      size_t nhi = node->nhi;
       struct line *temp = lines - total_lines;
       if (1 < nhi)
         sequential_sort (lines - nlo, nhi, temp - nlo / 2, false);
@@ -3506,16 +3562,16 @@ sortlines (struct line *restrict lines, struct
line *restrict dest,
         sequential_sort (lines, nlo, temp, false);

       /* Update merge NODE. No need to lock yet. */
-      node.lo = lines;
-      node.hi = lines - nlo;
-      node.end_lo = lines - nlo;
-      node.end_hi = lines - nlo - nhi;
+      node->lo = lines;
+      node->hi = lines - nlo;
+      node->end_lo = lines - nlo;
+      node->end_hi = lines - nlo - nhi;

-      queue_insert (queue, &node);
+      queue_insert (queue, node);
       merge_loop (queue, total_lines, tfp, temp_output);
     }

-  pthread_mutex_destroy (&node.lock);
+  pthread_mutex_destroy (&node->lock);
 }

 /* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3791,20 +3847,16 @@ sort (char * const *files, size_t nfiles, char
const *output_file,
             {
               struct merge_node_queue queue;
               queue_init (&queue, 2 * nthreads);
+              struct merge_node *merge_tree =
+                merge_tree_init (nthreads, buf.nlines, line);
+              struct merge_node *end_node = merge_tree;
+              struct merge_node *root_node = merge_tree + 1;

-              struct merge_node node;
-              node.lo = node.hi = node.end_lo = node.end_hi = NULL;
-              node.dest = NULL;
-              node.nlo = node.nhi = buf.nlines;
-              node.parent = NULL;
-              node.level = MERGE_END;
-              node.queued = false;
-              pthread_mutex_init (&node.lock, NULL);
-
-              sortlines (line, line, nthreads, buf.nlines, &node, true,
-                         &queue, tfp, temp_output);
+              sortlines (line, nthreads, buf.nlines, root_node,
+                         true, &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_mutex_destroy (&node.lock);
+              pthread_mutex_destroy (&root_node->lock);
+              merge_tree_destroy (merge_tree);
             }
           else
             write_unique (line - 1, tfp, temp_output);
-- 
1.7.1




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Fri, 10 Dec 2010 21:27:01 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Chen Guo <chen.guo.0625 <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: coreutils <at> gnu.org, bug-coreutils <at> gnu.org, Jim Meyering <jim <at> meyering.net>,
	DJ Lucas <dj <at> linuxfromscratch.org>
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the sort
	-u segfault)
Date: Fri, 10 Dec 2010 13:32:26 -0800
> +  size_t nlines = (lo_child)? node->parent->nlo : node->parent->nhi;

Small change... Remove unnecessary branch and redirection in sortlines:

  size_t nlines = node->nlo + node->nhi;


From c691813ecbfce60c207960fd8bd327cc7d99fe39 Mon Sep 17 00:00:00 2001
From: Chen Guo <chenguo4 <at> ucla.edu>
Date: Fri, 10 Dec 2010 13:13:36 -0800
Subject: [PATCH] sort: preallocate merge tree nodes to heap.

* src/sort.c: (merge_tree_init) New function. Allocates memory for
merge tree nodes.
(merge_tree_destory) New function.
(init_node) New function.
(sortlines) Refactor node creation code to init_node. Remove now
superfluous arguments. All callers changed.
(sort) Initialize/destory merge tree. Refactor root node creation
to merge_tree_init.
---
 src/sort.c |  170 +++++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 111 insertions(+), 59 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9cfc814..019b065 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -239,6 +239,8 @@ struct merge_node
   size_t nlo;                   /* Total Lines remaining from LO. */
   size_t nhi;                   /* Total lines remaining from HI. */
   struct merge_node *parent;    /* Parent node. */
+  struct merge_node *lo_child;  /* LO child node. */
+  struct merge_node *hi_child;  /* HI child node. */
   unsigned int level;           /* Level in merge tree. */
   bool queued;                  /* Node is already in heap. */
   pthread_mutex_t lock;         /* Lock for node operations. */
@@ -3137,6 +3139,83 @@ sequential_sort (struct line *restrict lines,
size_t nlines,
     }
 }

+struct merge_node * init_node (struct merge_node *, struct merge_node *,
+                               struct line *restrict, unsigned long int,
+                               size_t, bool);
+
+
+/* Initialize the merge tree. */
+static struct merge_node *
+merge_tree_init (unsigned long int nthreads, size_t nlines,
+                 struct line *restrict dest)
+{
+  struct merge_node *merge_tree = xmalloc (2 * nthreads * sizeof *merge_tree);
+
+  struct merge_node *root_node = merge_tree;
+  root_node->lo = root_node->hi = root_node->end_lo = root_node->end_hi = NULL;
+  root_node->dest = NULL;
+  root_node->nlo = root_node->nhi = nlines;
+  root_node->parent = NULL;
+  root_node->level = MERGE_END;
+  root_node->queued = false;
+  pthread_mutex_init (&root_node->lock, NULL);
+
+  init_node (root_node, root_node + 1, dest, nthreads, nlines, false);
+  return merge_tree;
+}
+
+/* Destroy the merge tree. */
+static void
+merge_tree_destroy (struct merge_node *merge_tree)
+{
+  free (merge_tree);
+}
+
+/* Initialize a merge tree node. */
+
+struct merge_node *
+init_node (struct merge_node *parent, struct merge_node *node_pool,
+           struct line *restrict dest, unsigned long int nthreads,
+           size_t total_lines, bool lo_child)
+{
+  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
+  size_t nlo = nlines / 2;
+  size_t nhi = nlines - nlo;
+  struct line *lo = dest - total_lines;
+  struct line *hi = lo - nlo;
+  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
+
+  struct merge_node *node = node_pool++;
+  node->lo = node->end_lo = lo;
+  node->hi = node->end_hi = hi;
+  node->dest = parent_end;
+  node->nlo = nlo;
+  node->nhi = nhi;
+  node->parent = parent;
+  node->level = parent->level + 1;
+  node->queued = false;
+  pthread_mutex_init (&node->lock, NULL);
+
+  if (nthreads > 1)
+    {
+      unsigned long int lo_threads = nthreads / 2;
+      unsigned long int hi_threads = nthreads - lo_threads;
+      node->lo_child = node_pool;
+      node_pool = init_node (node, node_pool, lo, lo_threads,
+                             total_lines, true);
+      node->hi_child = node_pool;
+      node_pool = init_node (node, node_pool, hi, hi_threads,
+                             total_lines, false);
+    }
+  else
+    {
+      node->lo_child = NULL;
+      node->hi_child = NULL;
+    }
+  return node_pool;
+}
+
+
 /* Compare two merge nodes A and B for priority.  */

 static int
@@ -3378,10 +3457,8 @@ merge_loop (struct merge_node_queue *queue,
 }


-static void sortlines (struct line *restrict, struct line *restrict,
-                       unsigned long int, size_t,
-                       struct merge_node *, bool,
-                       struct merge_node_queue *,
+static void sortlines (struct line *restrict, unsigned long int, size_t,
+                       struct merge_node *, bool, struct merge_node_queue *,
                        FILE *, char const *);

 /* Thread arguments for sortlines_thread. */
@@ -3392,19 +3469,15 @@ struct thread_args
      the end of the array.  */
   struct line *lines;

-  /* Destination, i.e., the array of lines to store result into.  This
-     also points just past the end of the array.  */
-  struct line *dest;
-
   /* Number of threads to use.  If 0 or 1, sort single-threaded.  */
   unsigned long int nthreads;

   /* Number of lines in LINES and DEST.  */
   size_t const total_lines;

-  /* Parent node; it will merge this node's output to the output
-     of this node's sibling.  */
-  struct merge_node *const parent;
+  /* Merge node. Lines from this node and this node's sibling will merged
+     to this node's parent. */
+  struct merge_node *const node;

   /* True if this node is sorting the lower half of the parent's work.  */
   bool lo_child;
@@ -3425,9 +3498,9 @@ static void *
 sortlines_thread (void *data)
 {
   struct thread_args const *args = data;
-  sortlines (args->lines, args->dest, args->nthreads, args->total_lines,
-             args->parent, args->lo_child, args->queue,
-             args->tfp, args->output_temp);
+  sortlines (args->lines, args->nthreads, args->total_lines,
+             args->node, args->lo_child, args->queue, args->tfp,
+             args->output_temp);
   return NULL;
 }

@@ -3456,49 +3529,32 @@ sortlines_thread (void *data)
    have been merged. */

 static void
-sortlines (struct line *restrict lines, struct line *restrict dest,
-           unsigned long int nthreads, size_t total_lines,
-           struct merge_node *parent, bool lo_child,
-           struct merge_node_queue *queue,
-           FILE *tfp, char const *temp_output)
+sortlines (struct line *restrict lines, unsigned long int nthreads,
+           size_t total_lines, struct merge_node *node, bool lo_child,
+           struct merge_node_queue *queue, FILE *tfp, char const *temp_output)
 {
-  /* Create merge tree NODE. */
-  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
-  size_t nlo = nlines / 2;
-  size_t nhi = nlines - nlo;
-  struct line *lo = dest - total_lines;
-  struct line *hi = lo - nlo;
-  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
-
-  struct merge_node node;
-  node.lo = node.end_lo = lo;
-  node.hi = node.end_hi = hi;
-  node.dest = parent_end;
-  node.nlo = nlo;
-  node.nhi = nhi;
-  node.parent = parent;
-  node.level = parent->level + 1;
-  node.queued = false;
-  pthread_mutex_init (&node.lock, NULL);
+  size_t nlines = node->nlo + node->nhi;

   /* Calculate thread arguments. */
   unsigned long int lo_threads = nthreads / 2;
   unsigned long int hi_threads = nthreads - lo_threads;
   pthread_t thread;
-  struct thread_args args = {lines, lo, lo_threads, total_lines, &node,
-                             true, queue, tfp, temp_output};
+  struct thread_args args = {lines, lo_threads, total_lines,
+                             node->lo_child, true, queue, tfp, temp_output};

   if (nthreads > 1 && SUBTHREAD_LINES_HEURISTIC <= nlines
       && pthread_create (&thread, NULL, sortlines_thread, &args) == 0)
     {
-      sortlines (lines - nlo, hi, hi_threads, total_lines, &node, false,
-                 queue, tfp, temp_output);
+      sortlines (lines - node->nlo, hi_threads, total_lines,
+                 node->hi_child, false, queue, tfp, temp_output);
       pthread_join (thread, NULL);
     }
   else
     {
       /* Nthreads = 1, this is a leaf NODE, or pthread_create failed.
          Sort with 1 thread. */
+      size_t nlo = node->nlo;
+      size_t nhi = node->nhi;
       struct line *temp = lines - total_lines;
       if (1 < nhi)
         sequential_sort (lines - nlo, nhi, temp - nlo / 2, false);
@@ -3506,16 +3562,16 @@ sortlines (struct line *restrict lines, struct
line *restrict dest,
         sequential_sort (lines, nlo, temp, false);

       /* Update merge NODE. No need to lock yet. */
-      node.lo = lines;
-      node.hi = lines - nlo;
-      node.end_lo = lines - nlo;
-      node.end_hi = lines - nlo - nhi;
+      node->lo = lines;
+      node->hi = lines - nlo;
+      node->end_lo = lines - nlo;
+      node->end_hi = lines - nlo - nhi;

-      queue_insert (queue, &node);
+      queue_insert (queue, node);
       merge_loop (queue, total_lines, tfp, temp_output);
     }

-  pthread_mutex_destroy (&node.lock);
+  pthread_mutex_destroy (&node->lock);
 }

 /* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3791,20 +3847,16 @@ sort (char * const *files, size_t nfiles, char
const *output_file,
             {
               struct merge_node_queue queue;
               queue_init (&queue, 2 * nthreads);
+              struct merge_node *merge_tree =
+                merge_tree_init (nthreads, buf.nlines, line);
+              struct merge_node *end_node = merge_tree;
+              struct merge_node *root_node = merge_tree + 1;

-              struct merge_node node;
-              node.lo = node.hi = node.end_lo = node.end_hi = NULL;
-              node.dest = NULL;
-              node.nlo = node.nhi = buf.nlines;
-              node.parent = NULL;
-              node.level = MERGE_END;
-              node.queued = false;
-              pthread_mutex_init (&node.lock, NULL);
-
-              sortlines (line, line, nthreads, buf.nlines, &node, true,
-                         &queue, tfp, temp_output);
+              sortlines (line, nthreads, buf.nlines, root_node,
+                         true, &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_mutex_destroy (&node.lock);
+              pthread_mutex_destroy (&root_node->lock);
+              merge_tree_destroy (merge_tree);
             }
           else
             write_unique (line - 1, tfp, temp_output);
-- 
1.7.1




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Sat, 11 Dec 2010 08:30:03 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Chen Guo <chen.guo.0625 <at> gmail.com>
Cc: coreutils <at> gnu.org, bug-coreutils <at> gnu.org, Jim Meyering <jim <at> meyering.net>,
	DJ Lucas <dj <at> linuxfromscratch.org>
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Sat, 11 Dec 2010 00:35:20 -0800
Thanks, Chen, that was much nicer than what I was writing.
I did some minor cleanups, mostly to do with catching an
unlikely integer overflow that would have made 'sort' crash badly,
and pushed the following set of patches.

From 780831a8602d9cdc22e7dcf44804e9c7183dd944 Mon Sep 17 00:00:00 2001
From: Chen Guo <chenguo4 <at> ucla.edu>
Date: Mon, 6 Dec 2010 00:15:42 -0800
Subject: [PATCH 1/4] sort: use mutexes, not spinlocks (avoid busy loop on blocked output)

Running a command like this on a multi-core system
  sort < big-file | less
would peg all processors at near 100% utilization.
* src/sort.c: (struct merge_node) Change member lock to mutex.
All uses changed.
* tests/Makefile.am (XFAIL_TESTS): Remove definition, now that
this test passes once again.  I.e., the sort-spinlock-abuse test
no longer fails.
* NEWS (Bug reports): Mention this.
Reported by DJ Lucas in http://debbugs.gnu.org/7489.
---
 NEWS              |    5 +++++
 src/sort.c        |   14 +++++++-------
 tests/Makefile.am |    3 ---
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/NEWS b/NEWS
index c3110a3..9f55cbb 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,11 @@ GNU coreutils NEWS                                    -*- outline -*-
   sort -u with at least two threads could attempt to read through a
   corrupted pointer. [bug introduced in coreutils-8.6]
 
+  sort with at least two threads and with blocked output would busy-loop
+  (spinlock) all threads, often using 100% of available CPU cycles to
+  do no work.  I.e., "sort < big-file | less" could waste a lot of power.
+  [bug introduced in coreutils-8.6]
+
 ** New features
 
   split accepts the --number option to generate a specific number of files.
diff --git a/src/sort.c b/src/sort.c
index a4c2cbf..9cfc814 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -241,7 +241,7 @@ struct merge_node
   struct merge_node *parent;    /* Parent node. */
   unsigned int level;           /* Level in merge tree. */
   bool queued;                  /* Node is already in heap. */
-  pthread_spinlock_t lock;      /* Lock for node operations. */
+  pthread_mutex_t lock;         /* Lock for node operations. */
 };
 
 /* Priority queue of merge nodes. */
@@ -3157,7 +3157,7 @@ compare_nodes (void const *a, void const *b)
 static inline void
 lock_node (struct merge_node *node)
 {
-  pthread_spin_lock (&node->lock);
+  pthread_mutex_lock (&node->lock);
 }
 
 /* Unlock a merge tree NODE. */
@@ -3165,7 +3165,7 @@ lock_node (struct merge_node *node)
 static inline void
 unlock_node (struct merge_node *node)
 {
-  pthread_spin_unlock (&node->lock);
+  pthread_mutex_unlock (&node->lock);
 }
 
 /* Destroy merge QUEUE. */
@@ -3479,7 +3479,7 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
   node.parent = parent;
   node.level = parent->level + 1;
   node.queued = false;
-  pthread_spin_init (&node.lock, PTHREAD_PROCESS_PRIVATE);
+  pthread_mutex_init (&node.lock, NULL);
 
   /* Calculate thread arguments. */
   unsigned long int lo_threads = nthreads / 2;
@@ -3515,7 +3515,7 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
       merge_loop (queue, total_lines, tfp, temp_output);
     }
 
-  pthread_spin_destroy (&node.lock);
+  pthread_mutex_destroy (&node.lock);
 }
 
 /* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3799,12 +3799,12 @@ sort (char * const *files, size_t nfiles, char const *output_file,
               node.parent = NULL;
               node.level = MERGE_END;
               node.queued = false;
-              pthread_spin_init (&node.lock, PTHREAD_PROCESS_PRIVATE);
+              pthread_mutex_init (&node.lock, NULL);
 
               sortlines (line, line, nthreads, buf.nlines, &node, true,
                          &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_spin_destroy (&node.lock);
+              pthread_mutex_destroy (&node.lock);
             }
           else
             write_unique (line - 1, tfp, temp_output);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index d52f677..b573061 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -656,7 +656,4 @@ pr_data =					\
   pr/ttb3-FF					\
   pr/w72l24f-ll
 
-XFAIL_TESTS =					\
-  misc/sort-spinlock-abuse
-
 include $(srcdir)/check.mk
-- 
1.7.2


From bcf9043b1f3983d099672047b36ad0a371af169d Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Fri, 10 Dec 2010 20:52:04 -0800
Subject: [PATCH 2/4] sort: comment fix

* src/sort.c: Comment fix re spin locks.
---
 src/sort.c |    7 +------
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9cfc814..36e3b19 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3149,10 +3149,7 @@ compare_nodes (void const *a, void const *b)
   return nodea->level < nodeb->level;
 }
 
-/* Lock a merge tree NODE.
-   Spin locks were seen to perform better than mutexes when the number
-   of threads is limited to the number of processors, assuming 'sort'
-   never waits when writing to stdout.  */
+/* Lock a merge tree NODE.  */
 
 static inline void
 lock_node (struct merge_node *node)
@@ -4567,8 +4564,6 @@ main (int argc, char **argv)
     }
   else
     {
-      /* If NTHREADS > number of cores on the machine, spinlocking
-         could be wasteful.  */
       unsigned long int np2 = num_processors (NPROC_CURRENT_OVERRIDABLE);
       if (!nthreads || nthreads > np2)
         nthreads = np2;
-- 
1.7.2


From a1f8177972fb3f864847dc45237ad7d4d6a7f395 Mon Sep 17 00:00:00 2001
From: Chen Guo <chenguo4 <at> ucla.edu>
Date: Fri, 10 Dec 2010 13:13:36 -0800
Subject: [PATCH 3/4] sort: preallocate merge tree nodes to heap.

* src/sort.c: (merge_tree_init) New function. Allocates memory for
merge tree nodes.
(merge_tree_destory) New function.
(init_node) New function.
(sortlines) Refactor node creation code to init_node. Remove now
superfluous arguments. All callers changed.
(sort) Initialize/destory merge tree. Refactor root node creation
to merge_tree_init.
---
 src/sort.c |  170 +++++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 111 insertions(+), 59 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 36e3b19..b724f3d 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -239,6 +239,8 @@ struct merge_node
   size_t nlo;                   /* Total Lines remaining from LO. */
   size_t nhi;                   /* Total lines remaining from HI. */
   struct merge_node *parent;    /* Parent node. */
+  struct merge_node *lo_child;  /* LO child node. */
+  struct merge_node *hi_child;  /* HI child node. */
   unsigned int level;           /* Level in merge tree. */
   bool queued;                  /* Node is already in heap. */
   pthread_mutex_t lock;         /* Lock for node operations. */
@@ -3137,6 +3139,83 @@ sequential_sort (struct line *restrict lines, size_t nlines,
     }
 }
 
+struct merge_node * init_node (struct merge_node *, struct merge_node *,
+                               struct line *restrict, unsigned long int,
+                               size_t, bool);
+
+
+/* Initialize the merge tree. */
+static struct merge_node *
+merge_tree_init (unsigned long int nthreads, size_t nlines,
+                 struct line *restrict dest)
+{
+  struct merge_node *merge_tree = xmalloc (2 * nthreads * sizeof *merge_tree);
+
+  struct merge_node *root_node = merge_tree;
+  root_node->lo = root_node->hi = root_node->end_lo = root_node->end_hi = NULL;
+  root_node->dest = NULL;
+  root_node->nlo = root_node->nhi = nlines;
+  root_node->parent = NULL;
+  root_node->level = MERGE_END;
+  root_node->queued = false;
+  pthread_mutex_init (&root_node->lock, NULL);
+
+  init_node (root_node, root_node + 1, dest, nthreads, nlines, false);
+  return merge_tree;
+}
+
+/* Destroy the merge tree. */
+static void
+merge_tree_destroy (struct merge_node *merge_tree)
+{
+  free (merge_tree);
+}
+
+/* Initialize a merge tree node. */
+
+struct merge_node *
+init_node (struct merge_node *parent, struct merge_node *node_pool,
+           struct line *restrict dest, unsigned long int nthreads,
+           size_t total_lines, bool lo_child)
+{
+  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
+  size_t nlo = nlines / 2;
+  size_t nhi = nlines - nlo;
+  struct line *lo = dest - total_lines;
+  struct line *hi = lo - nlo;
+  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
+
+  struct merge_node *node = node_pool++;
+  node->lo = node->end_lo = lo;
+  node->hi = node->end_hi = hi;
+  node->dest = parent_end;
+  node->nlo = nlo;
+  node->nhi = nhi;
+  node->parent = parent;
+  node->level = parent->level + 1;
+  node->queued = false;
+  pthread_mutex_init (&node->lock, NULL);
+
+  if (nthreads > 1)
+    {
+      unsigned long int lo_threads = nthreads / 2;
+      unsigned long int hi_threads = nthreads - lo_threads;
+      node->lo_child = node_pool;
+      node_pool = init_node (node, node_pool, lo, lo_threads,
+                             total_lines, true);
+      node->hi_child = node_pool;
+      node_pool = init_node (node, node_pool, hi, hi_threads,
+                             total_lines, false);
+    }
+  else
+    {
+      node->lo_child = NULL;
+      node->hi_child = NULL;
+    }
+  return node_pool;
+}
+
+
 /* Compare two merge nodes A and B for priority.  */
 
 static int
@@ -3375,10 +3454,8 @@ merge_loop (struct merge_node_queue *queue,
 }
 
 
-static void sortlines (struct line *restrict, struct line *restrict,
-                       unsigned long int, size_t,
-                       struct merge_node *, bool,
-                       struct merge_node_queue *,
+static void sortlines (struct line *restrict, unsigned long int, size_t,
+                       struct merge_node *, bool, struct merge_node_queue *,
                        FILE *, char const *);
 
 /* Thread arguments for sortlines_thread. */
@@ -3389,19 +3466,15 @@ struct thread_args
      the end of the array.  */
   struct line *lines;
 
-  /* Destination, i.e., the array of lines to store result into.  This
-     also points just past the end of the array.  */
-  struct line *dest;
-
   /* Number of threads to use.  If 0 or 1, sort single-threaded.  */
   unsigned long int nthreads;
 
   /* Number of lines in LINES and DEST.  */
   size_t const total_lines;
 
-  /* Parent node; it will merge this node's output to the output
-     of this node's sibling.  */
-  struct merge_node *const parent;
+  /* Merge node. Lines from this node and this node's sibling will merged
+     to this node's parent. */
+  struct merge_node *const node;
 
   /* True if this node is sorting the lower half of the parent's work.  */
   bool lo_child;
@@ -3422,9 +3495,9 @@ static void *
 sortlines_thread (void *data)
 {
   struct thread_args const *args = data;
-  sortlines (args->lines, args->dest, args->nthreads, args->total_lines,
-             args->parent, args->lo_child, args->queue,
-             args->tfp, args->output_temp);
+  sortlines (args->lines, args->nthreads, args->total_lines,
+             args->node, args->lo_child, args->queue, args->tfp,
+             args->output_temp);
   return NULL;
 }
 
@@ -3453,49 +3526,32 @@ sortlines_thread (void *data)
    have been merged. */
 
 static void
-sortlines (struct line *restrict lines, struct line *restrict dest,
-           unsigned long int nthreads, size_t total_lines,
-           struct merge_node *parent, bool lo_child,
-           struct merge_node_queue *queue,
-           FILE *tfp, char const *temp_output)
+sortlines (struct line *restrict lines, unsigned long int nthreads,
+           size_t total_lines, struct merge_node *node, bool lo_child,
+           struct merge_node_queue *queue, FILE *tfp, char const *temp_output)
 {
-  /* Create merge tree NODE. */
-  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
-  size_t nlo = nlines / 2;
-  size_t nhi = nlines - nlo;
-  struct line *lo = dest - total_lines;
-  struct line *hi = lo - nlo;
-  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
-
-  struct merge_node node;
-  node.lo = node.end_lo = lo;
-  node.hi = node.end_hi = hi;
-  node.dest = parent_end;
-  node.nlo = nlo;
-  node.nhi = nhi;
-  node.parent = parent;
-  node.level = parent->level + 1;
-  node.queued = false;
-  pthread_mutex_init (&node.lock, NULL);
+  size_t nlines = node->nlo + node->nhi;
 
   /* Calculate thread arguments. */
   unsigned long int lo_threads = nthreads / 2;
   unsigned long int hi_threads = nthreads - lo_threads;
   pthread_t thread;
-  struct thread_args args = {lines, lo, lo_threads, total_lines, &node,
-                             true, queue, tfp, temp_output};
+  struct thread_args args = {lines, lo_threads, total_lines,
+                             node->lo_child, true, queue, tfp, temp_output};
 
   if (nthreads > 1 && SUBTHREAD_LINES_HEURISTIC <= nlines
       && pthread_create (&thread, NULL, sortlines_thread, &args) == 0)
     {
-      sortlines (lines - nlo, hi, hi_threads, total_lines, &node, false,
-                 queue, tfp, temp_output);
+      sortlines (lines - node->nlo, hi_threads, total_lines,
+                 node->hi_child, false, queue, tfp, temp_output);
       pthread_join (thread, NULL);
     }
   else
     {
       /* Nthreads = 1, this is a leaf NODE, or pthread_create failed.
          Sort with 1 thread. */
+      size_t nlo = node->nlo;
+      size_t nhi = node->nhi;
       struct line *temp = lines - total_lines;
       if (1 < nhi)
         sequential_sort (lines - nlo, nhi, temp - nlo / 2, false);
@@ -3503,16 +3559,16 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
         sequential_sort (lines, nlo, temp, false);
 
       /* Update merge NODE. No need to lock yet. */
-      node.lo = lines;
-      node.hi = lines - nlo;
-      node.end_lo = lines - nlo;
-      node.end_hi = lines - nlo - nhi;
+      node->lo = lines;
+      node->hi = lines - nlo;
+      node->end_lo = lines - nlo;
+      node->end_hi = lines - nlo - nhi;
 
-      queue_insert (queue, &node);
+      queue_insert (queue, node);
       merge_loop (queue, total_lines, tfp, temp_output);
     }
 
-  pthread_mutex_destroy (&node.lock);
+  pthread_mutex_destroy (&node->lock);
 }
 
 /* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3788,20 +3844,16 @@ sort (char * const *files, size_t nfiles, char const *output_file,
             {
               struct merge_node_queue queue;
               queue_init (&queue, 2 * nthreads);
+              struct merge_node *merge_tree =
+                merge_tree_init (nthreads, buf.nlines, line);
+              struct merge_node *end_node = merge_tree;
+              struct merge_node *root_node = merge_tree + 1;
 
-              struct merge_node node;
-              node.lo = node.hi = node.end_lo = node.end_hi = NULL;
-              node.dest = NULL;
-              node.nlo = node.nhi = buf.nlines;
-              node.parent = NULL;
-              node.level = MERGE_END;
-              node.queued = false;
-              pthread_mutex_init (&node.lock, NULL);
-
-              sortlines (line, line, nthreads, buf.nlines, &node, true,
-                         &queue, tfp, temp_output);
+              sortlines (line, nthreads, buf.nlines, root_node,
+                         true, &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_mutex_destroy (&node.lock);
+              pthread_mutex_destroy (&root_node->lock);
+              merge_tree_destroy (merge_tree);
             }
           else
             write_unique (line - 1, tfp, temp_output);
-- 
1.7.2


From 8413953717970bc1cf33c52a5882489717304624 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Sat, 11 Dec 2010 00:27:05 -0800
Subject: [PATCH 4/4] sort: integer overflow checks in thread counts, etc.

* src/sort.c (specify_nthreads, merge_tree_init, init_node):
(queue_init, sortlines, struct thread_args, sort, main):
Use size_t, not unsigned long int, for thread counts, since thread
counts are now used to compute sizes.
(specify_nthreads): Check for size_t overflow.
(merge_tree_init, sort): Shorten name of local variable, for
readability.
(merge_tree_init): Move constants next to each other in product,
so that the constant folding is easier to see.
(init_node): Now static.  Add 'restrict' only where it might
be helpful for compiler optimization.
(queue_init): 2nd arg is now nthreads, not "reserve", which is
a bit harder to follow.  All uses changed.
(struct thread_args): Rename lo_child to is_lo_child, so that
it's obvious to the reader when we're talking about this boolean
as opposed to the new lo_child member of the other structure.
All uses changed.
(sort): Remove unused local variable end_node.
(main): Don't allow large thread counts to cause undefined behavior
later, due to integer overflow.
---
 src/sort.c |  115 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 64 insertions(+), 51 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index b724f3d..2c0f852 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1379,15 +1379,17 @@ specify_sort_size (int oi, char c, char const *s)
 }
 
 /* Specify the number of threads to spawn during internal sort.  */
-static unsigned long int
+static size_t
 specify_nthreads (int oi, char c, char const *s)
 {
   unsigned long int nthreads;
   enum strtol_error e = xstrtoul (s, NULL, 10, &nthreads, "");
   if (e == LONGINT_OVERFLOW)
-    return ULONG_MAX;
+    return SIZE_MAX;
   if (e != LONGINT_OK)
     xstrtol_fatal (e, oi, c, long_options, s);
+  if (SIZE_MAX < nthreads)
+    nthreads = SIZE_MAX;
   if (nthreads == 0)
     error (SORT_FAILURE, 0, _("number in parallel must be nonzero"));
   return nthreads;
@@ -3139,28 +3141,28 @@ sequential_sort (struct line *restrict lines, size_t nlines,
     }
 }
 
-struct merge_node * init_node (struct merge_node *, struct merge_node *,
-                               struct line *restrict, unsigned long int,
-                               size_t, bool);
+static struct merge_node *init_node (struct merge_node *restrict,
+                                     struct merge_node *restrict,
+                                     struct line *, size_t, size_t, bool);
 
 
-/* Initialize the merge tree. */
+/* Create and return a merge tree for NTHREADS threads, sorting NLINES
+   lines, with destination DEST.  */
 static struct merge_node *
-merge_tree_init (unsigned long int nthreads, size_t nlines,
-                 struct line *restrict dest)
+merge_tree_init (size_t nthreads, size_t nlines, struct line *dest)
 {
-  struct merge_node *merge_tree = xmalloc (2 * nthreads * sizeof *merge_tree);
-
-  struct merge_node *root_node = merge_tree;
-  root_node->lo = root_node->hi = root_node->end_lo = root_node->end_hi = NULL;
-  root_node->dest = NULL;
-  root_node->nlo = root_node->nhi = nlines;
-  root_node->parent = NULL;
-  root_node->level = MERGE_END;
-  root_node->queued = false;
-  pthread_mutex_init (&root_node->lock, NULL);
-
-  init_node (root_node, root_node + 1, dest, nthreads, nlines, false);
+  struct merge_node *merge_tree = xmalloc (2 * sizeof *merge_tree * nthreads);
+
+  struct merge_node *root = merge_tree;
+  root->lo = root->hi = root->end_lo = root->end_hi = NULL;
+  root->dest = NULL;
+  root->nlo = root->nhi = nlines;
+  root->parent = NULL;
+  root->level = MERGE_END;
+  root->queued = false;
+  pthread_mutex_init (&root->lock, NULL);
+
+  init_node (root, root + 1, dest, nthreads, nlines, false);
   return merge_tree;
 }
 
@@ -3171,19 +3173,25 @@ merge_tree_destroy (struct merge_node *merge_tree)
   free (merge_tree);
 }
 
-/* Initialize a merge tree node. */
+/* Initialize a merge tree node and its descendants.  The node's
+   parent is PARENT.  The node and its descendants are taken from the
+   array of nodes NODE_POOL.  Their destination starts at DEST; they
+   will consume NTHREADS threads.  The total number of sort lines is
+   TOTAL_LINES.  IS_LO_CHILD is true if the node is the low child of
+   its parent.  */
 
-struct merge_node *
-init_node (struct merge_node *parent, struct merge_node *node_pool,
-           struct line *restrict dest, unsigned long int nthreads,
-           size_t total_lines, bool lo_child)
+static struct merge_node *
+init_node (struct merge_node *restrict parent,
+           struct merge_node *restrict node_pool,
+           struct line *dest, size_t nthreads,
+           size_t total_lines, bool is_lo_child)
 {
-  size_t nlines = (lo_child)? parent->nlo : parent->nhi;
+  size_t nlines = (is_lo_child ? parent->nlo : parent->nhi);
   size_t nlo = nlines / 2;
   size_t nhi = nlines - nlo;
   struct line *lo = dest - total_lines;
   struct line *hi = lo - nlo;
-  struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
+  struct line **parent_end = (is_lo_child ? &parent->end_lo : &parent->end_hi);
 
   struct merge_node *node = node_pool++;
   node->lo = node->end_lo = lo;
@@ -3198,8 +3206,8 @@ init_node (struct merge_node *parent, struct merge_node *node_pool,
 
   if (nthreads > 1)
     {
-      unsigned long int lo_threads = nthreads / 2;
-      unsigned long int hi_threads = nthreads - lo_threads;
+      size_t lo_threads = nthreads / 2;
+      size_t hi_threads = nthreads - lo_threads;
       node->lo_child = node_pool;
       node_pool = init_node (node, node_pool, lo, lo_threads,
                              total_lines, true);
@@ -3254,15 +3262,16 @@ queue_destroy (struct merge_node_queue *queue)
   pthread_mutex_destroy (&queue->mutex);
 }
 
-/* Initialize merge QUEUE, allocating space for a maximum of RESERVE nodes.
-   Though it's highly unlikely all nodes are in the heap at the same time,
-   RESERVE should accommodate all of them. Counting a NULL dummy head for the
-   heap, RESERVE should be 2 * NTHREADS. */
+/* Initialize merge QUEUE, allocating space suitable for a maximum of
+   NTHREADS threads.  */
 
 static void
-queue_init (struct merge_node_queue *queue, size_t reserve)
+queue_init (struct merge_node_queue *queue, size_t nthreads)
 {
-  queue->priority_queue = heap_alloc (compare_nodes, reserve);
+  /* Though it's highly unlikely all nodes are in the heap at the same
+     time, the heap should accommodate all of them.  Counting a NULL
+     dummy head for the heap, reserve 2 * NTHREADS nodes.  */
+  queue->priority_queue = heap_alloc (compare_nodes, 2 * nthreads);
   pthread_mutex_init (&queue->mutex, NULL);
   pthread_cond_init (&queue->cond, NULL);
 }
@@ -3454,7 +3463,7 @@ merge_loop (struct merge_node_queue *queue,
 }
 
 
-static void sortlines (struct line *restrict, unsigned long int, size_t,
+static void sortlines (struct line *restrict, size_t, size_t,
                        struct merge_node *, bool, struct merge_node_queue *,
                        FILE *, char const *);
 
@@ -3467,7 +3476,7 @@ struct thread_args
   struct line *lines;
 
   /* Number of threads to use.  If 0 or 1, sort single-threaded.  */
-  unsigned long int nthreads;
+  size_t nthreads;
 
   /* Number of lines in LINES and DEST.  */
   size_t const total_lines;
@@ -3477,7 +3486,7 @@ struct thread_args
   struct merge_node *const node;
 
   /* True if this node is sorting the lower half of the parent's work.  */
-  bool lo_child;
+  bool is_lo_child;
 
   /* The priority queue controlling available work for the entire
      internal sort.  */
@@ -3496,7 +3505,7 @@ sortlines_thread (void *data)
 {
   struct thread_args const *args = data;
   sortlines (args->lines, args->nthreads, args->total_lines,
-             args->node, args->lo_child, args->queue, args->tfp,
+             args->node, args->is_lo_child, args->queue, args->tfp,
              args->output_temp);
   return NULL;
 }
@@ -3526,15 +3535,15 @@ sortlines_thread (void *data)
    have been merged. */
 
 static void
-sortlines (struct line *restrict lines, unsigned long int nthreads,
-           size_t total_lines, struct merge_node *node, bool lo_child,
+sortlines (struct line *restrict lines, size_t nthreads,
+           size_t total_lines, struct merge_node *node, bool is_lo_child,
            struct merge_node_queue *queue, FILE *tfp, char const *temp_output)
 {
   size_t nlines = node->nlo + node->nhi;
 
   /* Calculate thread arguments. */
-  unsigned long int lo_threads = nthreads / 2;
-  unsigned long int hi_threads = nthreads - lo_threads;
+  size_t lo_threads = nthreads / 2;
+  size_t hi_threads = nthreads - lo_threads;
   pthread_t thread;
   struct thread_args args = {lines, lo_threads, total_lines,
                              node->lo_child, true, queue, tfp, temp_output};
@@ -3774,7 +3783,7 @@ merge (struct sortfile *files, size_t ntemps, size_t nfiles,
 
 static void
 sort (char * const *files, size_t nfiles, char const *output_file,
-      unsigned long int nthreads)
+      size_t nthreads)
 {
   struct buffer buf;
   size_t ntemps = 0;
@@ -3793,7 +3802,7 @@ sort (char * const *files, size_t nfiles, char const *output_file,
       if (nthreads > 1)
         {
           /* Get log P. */
-          unsigned long int tmp = 1;
+          size_t tmp = 1;
           size_t mult = 1;
           while (tmp < nthreads)
             {
@@ -3843,16 +3852,15 @@ sort (char * const *files, size_t nfiles, char const *output_file,
           if (1 < buf.nlines)
             {
               struct merge_node_queue queue;
-              queue_init (&queue, 2 * nthreads);
+              queue_init (&queue, nthreads);
               struct merge_node *merge_tree =
                 merge_tree_init (nthreads, buf.nlines, line);
-              struct merge_node *end_node = merge_tree;
-              struct merge_node *root_node = merge_tree + 1;
+              struct merge_node *root = merge_tree + 1;
 
-              sortlines (line, nthreads, buf.nlines, root_node,
+              sortlines (line, nthreads, buf.nlines, root,
                          true, &queue, tfp, temp_output);
               queue_destroy (&queue);
-              pthread_mutex_destroy (&root_node->lock);
+              pthread_mutex_destroy (&root->lock);
               merge_tree_destroy (merge_tree);
             }
           else
@@ -4076,7 +4084,7 @@ main (int argc, char **argv)
   bool mergeonly = false;
   char *random_source = NULL;
   bool need_random = false;
-  unsigned long int nthreads = 0;
+  size_t nthreads = 0;
   size_t nfiles = 0;
   bool posixly_correct = (getenv ("POSIXLY_CORRECT") != NULL);
   bool obsolete_usage = (posix2_version () < 200112);
@@ -4620,6 +4628,11 @@ main (int argc, char **argv)
       if (!nthreads || nthreads > np2)
         nthreads = np2;
 
+      /* Avoid integer overflow later.  */
+      size_t nthreads_max = SIZE_MAX / (2 * sizeof (struct merge_node));
+      if (nthreads_max < nthreads)
+        nthreads = nthreads_max;
+
       sort (files, nfiles, outfile, nthreads);
     }
 
-- 
1.7.2





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Sat, 11 Dec 2010 11:04:01 GMT) Full text and rfc822 format available.

Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the sort
	-u segfault)
Date: Sat, 11 Dec 2010 12:09:17 +0100
Paul Eggert wrote:
> Thanks, Chen, that was much nicer than what I was writing.
> I did some minor cleanups, mostly to do with catching an
> unlikely integer overflow that would have made 'sort' crash badly,
> and pushed the following set of patches.

Thanks for helping, but...

Chen's log message would have been appropriate if that patch had been
rebased to apply after my stack->heap bug fix.  However, as a replacement
for my fix, the description is lacking, since it says nothing about fixing
the hard-to-reproduce (and harder-to-diagnose) segfault-inducing bug.
Plus, the change set includes no NEWS or test suite addition.

With each bug fix it is best to describe
or at least mention the bug in the commit log.
Also, there should be a NEWS entry and, preferably,
a test or two to exercise the bug.

Here at least is the NEWS addition and log from what I posted Thursday.
I've also fixed a few syntax nits separately:


From 9a9d69e9e463ebfa67e90ecc061305532a91543e Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> redhat.com>
Date: Sat, 11 Dec 2010 11:29:38 +0100
Subject: [PATCH 1/2] sort: syntax cleanup

* src/sort.c (xfopen, debug_key, sortlines, sort, main): Adjust
formatting: fix misplaced braces, use consistent spacing,
split a 2-stmt line.
---
 src/sort.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 2c0f852..1de8115 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -939,7 +939,7 @@ stream_open (char const *file, char const *how)

 static FILE *
 xfopen (char const *file, char const *how)
- {
+{
   FILE *fp = stream_open (file, how);
   if (!fp)
     die (_("open failed"), file);
@@ -2207,7 +2207,8 @@ debug_key (struct line const *line, struct keyfield const *key)

       if (key->skipsblanks || key->month || key_numeric (key))
         {
-          char saved = *lim; *lim = '\0';
+          char saved = *lim;
+          *lim = '\0';

           while (blanks[to_uchar (*beg)])
             beg++;
@@ -3782,7 +3783,7 @@ merge (struct sortfile *files, size_t ntemps, size_t nfiles,
 /* Sort NFILES FILES onto OUTPUT_FILE.  Use at most NTHREADS threads.  */

 static void
-sort (char * const *files, size_t nfiles, char const *output_file,
+sort (char *const *files, size_t nfiles, char const *output_file,
       size_t nthreads)
 {
   struct buffer buf;
@@ -4498,7 +4499,7 @@ main (int argc, char **argv)
           files = tok.tok;
           nfiles = tok.n_tok;
           for (i = 0; i < nfiles; i++)
-          {
+            {
               if (STREQ (files[i], "-"))
                 error (SORT_FAILURE, 0, _("when reading file names from stdin, "
                                           "no file name of %s allowed"),
@@ -4513,7 +4514,7 @@ main (int argc, char **argv)
                          _("%s:%lu: invalid zero-length file name"),
                          quotearg_colon (files_from), file_number);
                 }
-          }
+            }
         }
       else
         error (SORT_FAILURE, 0, _("no input from %s"),
--
1.7.3.3


From ad61335bf832937dd95739c70405db9b61ffda37 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> redhat.com>
Date: Sat, 11 Dec 2010 11:38:21 +0100
Subject: [PATCH 2/2] sort: avoid segfault when using two or more threads

This change does not fix the actual bug.  That was done by commit
c9db0ac6, "sort: preallocate merge tree nodes to heap".  The fix
was to store each "node" structure on the heap, not on the stack.
Otherwise, a node from one thread's stack could be used in another
thread after the first thread had expired (via pthread_join).
This bug was very hard to trigger when using spinlocks, but
easier once we began using mutexes.
* NEWS (Bug fixes): Mention it.
For details, see http://debbugs.gnu.org/7597.
---
 NEWS |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/NEWS b/NEWS
index 9f55cbb..b0a11b1 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,9 @@ GNU coreutils NEWS                                    -*- outline -*-
   do no work.  I.e., "sort < big-file | less" could waste a lot of power.
   [bug introduced in coreutils-8.6]

+  sort with at least two threads no longer segfaults due to use of pointers
+  into the stack of an expired thread. [bug introduced in coreutils-8.6]
+
 ** New features

   split accepts the --number option to generate a specific number of files.
--
1.7.3.3




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Sat, 11 Dec 2010 20:07:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Sat, 11 Dec 2010 12:00:48 -0800
Sorry for botching the NEWS and the change log.  To help
make amends, how about if I add a test case for that?
I'm thinking of the 2nd test case in
<http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00043.html>,
namely this one:

gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
  --parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done

Or if you have a better one in mind, please let me know.

There are also some test cases I need to add for the
(unrelated) sort-compression bug, which is next on my
list of coreutils bugs to look at.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Sun, 12 Dec 2010 15:36:02 GMT) Full text and rfc822 format available.

Message #29 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the sort
	-u segfault)
Date: Sun, 12 Dec 2010 16:41:07 +0100
Paul Eggert wrote:
> Sorry for botching the NEWS and the change log.  To help
> make amends, how about if I add a test case for that?

That would be welcome.  Thanks.

> I'm thinking of the 2nd test case in
> <http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00043.html>,
> namely this one:
>
> gensort -a 10000 > gensort-10k
> for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
>   --parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done

That sounds good, assuming it triggers the bug reliably for you.
I was hoping to find a way to reproduce it without relying on gensort,
but won't object if you want to do that.

I ran the above a few times on a 6-core i7 970 and it would usually
fail only 2 or 3 times out of 2000.  For one trial, it failed
only once, and that was on the 1932nd iteration.

Interestingly, when I run enough of those 2-process jobs in parallel to keep
all nominal 12 processors busy, I get 80-100 failures in 10,000 trials.

cat <<\EOF > sort-test
#!/bin/bash
exp_output=$1
t=$(mktemp) || exit 2
src/sort --parallel=2 -S 100k in > $t || exit 3
cmp $t $exp_output || exit 4
rm -f $t
exit 0
EOF
chmod a+x sort-test

i.e., by running that little script via GNU parallel:

    export TMPDIR=/dev/shm/z; mkdir $TMPDIR
    gensort -a 1000 in; sort in > exp; seq 10000 \
      | parallel --eta -j $(($(nproc)/2)) ./sort-test exp

This shows how many failed of the 10,000:

    ls -1 $TMPDIR/tmp.*|wc -l                                                     :

I'm not suggesting to use GNU parallel in the test,
but if you have a few spare cores or systems, it does
make it easier to run large parallel tests in an attempt
to find something more reliable.

I noticed that decreasing the input size to 1000 lines
had little effect on the number of failures, but obviously
makes the test less expensive.

> Or if you have a better one in mind, please let me know.
>
> There are also some test cases I need to add for the
> (unrelated) sort-compression bug, which is next on my
> list of coreutils bugs to look at.

It would be great to fix that for 8.8, too,
but don't worry if you don't get to it soon.
I can always make an 8.9 not long afterward.
Perhaps very few people use --compress-program=PROG,
or this is due to a latent problem that is showing up only now.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Sun, 12 Dec 2010 21:37:02 GMT) Full text and rfc822 format available.

Message #32 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Sun, 12 Dec 2010 13:42:31 -0800
On 12/12/2010 07:41 AM, Jim Meyering wrote:
> That sounds good, assuming it triggers the bug reliably for you.
> I was hoping to find a way to reproduce it without relying on gensort,
> but won't object if you want to do that.

In my attempts to reproduce the problem, it's pretty flaky.
I think it depends on how busy the operating system is.
Sometimes I'd get failures all the time; sometimes, almost
never.  (This was with valgrind; I had much less luck without
valgrind.)

Anyway, I pushed this, which seemed to work well enough
on my host.  It prefers gensort if available, but falls
back on seq+shuf if not.

From 63d1b425976ccc0b89159d743e33eb5da634de3c Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Sun, 12 Dec 2010 13:38:19 -0800
Subject: [PATCH] tests: test for access to stale thread memory

* tests/misc/sort-stale-thread-mem: New tests.
* tests/Makefile.am (TESTS): Add it.
---
 tests/Makefile.am                |    1 +
 tests/misc/sort-stale-thread-mem |   44 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/sort-stale-thread-mem

diff --git a/tests/Makefile.am b/tests/Makefile.am
index b573061..f7a8af8 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -238,6 +238,7 @@ TESTS =						\
   misc/sort-month				\
   misc/sort-rand				\
   misc/sort-spinlock-abuse			\
+  misc/sort-stale-thread-mem			\
   misc/sort-unique				\
   misc/sort-unique-segv				\
   misc/sort-version				\
diff --git a/tests/misc/sort-stale-thread-mem b/tests/misc/sort-stale-thread-mem
new file mode 100755
index 0000000..c4f4fcb
--- /dev/null
+++ b/tests/misc/sort-stale-thread-mem
@@ -0,0 +1,44 @@
+#!/bin/sh
+# Trigger a bug that would cause 'sort' to reference stale thread stack memory.
+
+# Copyright (C) 2010 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# written by Jim Meyering and Paul Eggert
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+print_ver_ sort
+
+expensive_
+
+valgrind --help >/dev/null || skip_ "requires valgrind"
+test "$(nproc)" = 1 && skip_ "requires a multi-core system"
+
+# gensort output seems to trigger the failure more often,
+# so prefer gensort if it is available.
+(gensort -a 10000 in) 2>/dev/null ||
+  seq -f %-98f 10000 | shuf > in ||
+  framework_failure_
+
+# With the bug, 'sort' would fail under valgrind about half the time,
+# on some circa-2010 multicore Linux platforms.  Run the test 10 times
+# so that the probability of missing the bug should be about 1 in
+# 2**100 on these hosts.
+fail=0
+for i in $(seq 100); do
+  valgrind --quiet --error-exitcode=3 \
+      sort -S 100K --parallel=2 in > /dev/null ||
+    { fail=$?; echo iteration $i failed; Exit $fail; }
+done
-- 
1.7.2





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Mon, 13 Dec 2010 07:25:02 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the sort
	-u segfault)
Date: Mon, 13 Dec 2010 08:30:44 +0100
Paul Eggert wrote:
> On 12/12/2010 07:41 AM, Jim Meyering wrote:
>> That sounds good, assuming it triggers the bug reliably for you.
>> I was hoping to find a way to reproduce it without relying on gensort,
>> but won't object if you want to do that.
>
> In my attempts to reproduce the problem, it's pretty flaky.
> I think it depends on how busy the operating system is.
> Sometimes I'd get failures all the time; sometimes, almost
> never.  (This was with valgrind; I had much less luck without
> valgrind.)
>
> Anyway, I pushed this, which seemed to work well enough
> on my host.  It prefers gensort if available, but falls
> back on seq+shuf if not.
>
> Subject: [PATCH] tests: test for access to stale thread memory
>
> * tests/misc/sort-stale-thread-mem: New tests.
> * tests/Makefile.am (TESTS): Add it.

Thank you!
I did this, too:


From 8351407f874ab3d6fc0830e78a6c234bf1583e3f Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> redhat.com>
Date: Mon, 13 Dec 2010 08:07:25 +0100
Subject: [PATCH 1/2] tests: mark new test as very expensive

* tests/misc/sort-stale-thread-mem: Don't initialize fail=0 here;
that is done in init.sh.  This avoids a syntax-check failure.
Invoke "Exit $fail" at end, too.
Mark as a very expensive test.
---
 tests/misc/sort-stale-thread-mem |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/misc/sort-stale-thread-mem b/tests/misc/sort-stale-thread-mem
index c4f4fcb..8ad60ed 100755
--- a/tests/misc/sort-stale-thread-mem
+++ b/tests/misc/sort-stale-thread-mem
@@ -21,7 +21,7 @@
 . "${srcdir=.}/init.sh"; path_prepend_ ../src
 print_ver_ sort

-expensive_
+very_expensive_

 valgrind --help >/dev/null || skip_ "requires valgrind"
 test "$(nproc)" = 1 && skip_ "requires a multi-core system"
@@ -36,9 +36,10 @@ test "$(nproc)" = 1 && skip_ "requires a multi-core system"
 # on some circa-2010 multicore Linux platforms.  Run the test 10 times
 # so that the probability of missing the bug should be about 1 in
 # 2**100 on these hosts.
-fail=0
 for i in $(seq 100); do
   valgrind --quiet --error-exitcode=3 \
       sort -S 100K --parallel=2 in > /dev/null ||
     { fail=$?; echo iteration $i failed; Exit $fail; }
 done
+
+Exit $fail
--
1.7.3.3.38.gc6d05


From 0c70708db7ed32d2b379116dc6bf64f07539aaf1 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering <at> redhat.com>
Date: Mon, 13 Dec 2010 08:19:12 +0100
Subject: [PATCH 2/2] tests: tweak basic-1 to use warn_ rather than literal "exit 77"

* tests/install/basic-1 (just_built_dd): Use warn_, rather than
cat and exit 77.
---
 tests/install/basic-1 |   24 ++++++------------------
 1 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/tests/install/basic-1 b/tests/install/basic-1
index 5e07bab..3c45c2a 100755
--- a/tests/install/basic-1
+++ b/tests/install/basic-1
@@ -39,28 +39,16 @@ dd2=dd2$EXEEXT

 just_built_dd=$abs_top_builddir/src/$dd

-test -r "$just_built_dd" || \
-  {
-    cat 1>&2 <<EOF
-$0: WARNING!!!
-Your just-built dd binary, $just_built_dd
-is not readable, so skipping the remaining tests in this file.
-EOF
-    exit 77
-  }
+test -r "$just_built_dd" \
+  || warn_ "WARNING!!! Your just-built dd binary, $just_built_dd
+is not readable, so skipping the remaining tests in this file."

 cp "$just_built_dd" . || fail=1
 cp $dd $dd2 || fail=1

-strip $dd2 || \
-  {
-    cat 1>&2 <<EOF
-$0: WARNING!!!
-Your strip command doesn't seem to work, so skipping
-the test of install's --strip option.
-EOF
-    exit 77
-  }
+strip $dd2 \
+  || warn_ "WARNING!!! Your strip command doesn't seem to work,
+so skipping the test of install's --strip option."

 # This test would fail with 3.16s when using versions of strip that
 # don't work on read-only files (the one from binutils works fine).
--
1.7.3.3.38.gc6d05




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Mon, 13 Dec 2010 17:59:01 GMT) Full text and rfc822 format available.

Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Mon, 13 Dec 2010 10:04:25 -0800
My recent patch had a typo in a comment, which I fixed as follows:

From 7e9599422e85be01dfceecf1f38ff2c2952a3f61 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Mon, 13 Dec 2010 10:02:06 -0800
Subject: [PATCH] tests: typo fix

* tests/misc/sort-stale-thread-mem: Fix typo in comment.
---
 tests/misc/sort-stale-thread-mem |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tests/misc/sort-stale-thread-mem b/tests/misc/sort-stale-thread-mem
index 8ad60ed..2955e22 100755
--- a/tests/misc/sort-stale-thread-mem
+++ b/tests/misc/sort-stale-thread-mem
@@ -33,7 +33,7 @@ test "$(nproc)" = 1 && skip_ "requires a multi-core system"
   framework_failure_
 
 # With the bug, 'sort' would fail under valgrind about half the time,
-# on some circa-2010 multicore Linux platforms.  Run the test 10 times
+# on some circa-2010 multicore Linux platforms.  Run the test 100 times
 # so that the probability of missing the bug should be about 1 in
 # 2**100 on these hosts.
 for i in $(seq 100); do
-- 
1.7.2





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Thu, 16 Dec 2010 22:01:02 GMT) Full text and rfc822 format available.

Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, bug-coreutils <at> gnu.org,
	DJ Lucas <dj <at> linuxfromscratch.org>, coreutils <at> gnu.org
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort -u segfault)
Date: Thu, 16 Dec 2010 14:06:01 -0800
On 12/12/10 07:41, Jim Meyering wrote:
> Paul Eggert wrote:
>> There are also some test cases I need to add for the
>> (unrelated) sort-compression bug, which is next on my
>> list of coreutils bugs to look at.
> 
> It would be great to fix that for 8.8, too,
> but don't worry if you don't get to it soon.

I've fixed that now, with the patch enclosed at the end of this
message.  I was trying just to fix bug #2 mentioned in
<http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00078.html>.
But I discovered that the patch also fixes bug #3 listed there, as
that bug was also present in the old reference-count implementation of
waiting for subprocesses.  Bug #1 was fixed a day or so ago, so, as
far as I know, all the 'sort' bugs we've discussed in the last month
or so are fixed now.

The new test takes up to 28 minutes when it fails, so I marked it as
very expensive (I don't know what the boundary between "expensive"
and "very expensive" is, but I'm kind of an impatient guy....).


From 10b004660cec5470aa685c4ab327994d14eafeab Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Thu, 16 Dec 2010 13:55:13 -0800
Subject: [PATCH] sort: fix hang with sort --compress

* NEWS: Document this.
* src/sort.c (UNCOMPRESSED, UNREAPED, REAPED): New constants.
(struct tempnode): New member 'state', to hold these constants.
The pid member is now undefined if state == UNCOMPRESSED.
(struct sortfile): Replace member 'pid' with member 'temp'.
(uintptr): Remove.
(proctab_hasher, proctab_comparator, register_proc, delete_proc):
Proctab entries are now struct tempnode *, not pid_t, to handle
the case where multiple tempnode objects correspond to the same
pid.  This avoids a race condition that can cause a hang.
(register_proc): Arg is now struct tempnode *, not pid_t.  All
callers changed.
(delete_proc): Set tempnode state to REAPED.
(create_temp_file): No need to set pid member here; it's now
done when the pid is known.
(maybe_create_temp, create_temp): Remove PPID arg.  Return struct
tempnode *, not char *.  All callers changed.
(maybe_create_temp): Set node state to UNCOMPRESSED or UNREAPED.
No need to set node->pid to 0.
(open_temp): Replace NAME and PID args with a single TEMP arg.
All callers changed.  Wait only for unreaped children.
(zaptemp): Wait for decompressor to finish before removing its
temporary-file input.  This avoids .nfsXXXX hassles with NFS
and fixes a race (leading to a hang) regardless of NFS.
(open_input_files): Adjust to new way of dealing with temp files
and their subprocesses.
* tests/Makefile.am (TESTS): Add misc/sort-compress-hang.
* tests/misc/sort-compress-hang: New file.
---
 NEWS                          |    3 +-
 src/sort.c                    |  159 +++++++++++++++++++++--------------------
 tests/Makefile.am             |    1 +
 tests/misc/sort-compress-hang |   53 ++++++++++++++
 4 files changed, 136 insertions(+), 80 deletions(-)
 create mode 100755 tests/misc/sort-compress-hang

diff --git a/NEWS b/NEWS
index 429a1b7..a69ef54 100644
--- a/NEWS
+++ b/NEWS
@@ -21,7 +21,8 @@ GNU coreutils NEWS                                    -*- outline -*-
   sort with at least two threads no longer segfaults due to use of pointers
   into the stack of an expired thread. [bug introduced in coreutils-8.6]
 
-  sort --compress no longer mishandles subprocesses' exit statuses.
+  sort --compress no longer mishandles subprocesses' exit statuses,
+  and no longer hangs indefinitely due to a bug in waiting for subprocesses.
 
   sort -m -o f f ... f no longer dumps core when file descriptors are limited.
 
diff --git a/src/sort.c b/src/sort.c
index f53e64d..6bce49b 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -610,32 +610,33 @@ cs_leave (struct cs_status status)
     }
 }
 
+/* Possible states for a temp file.  If compressed, the file's status
+   is unreaped or reaped, depending on whether 'sort' has waited for
+   the subprocess to finish.  */
+enum { UNCOMPRESSED, UNREAPED, REAPED };
+
 /* The list of temporary files. */
 struct tempnode
 {
   struct tempnode *volatile next;
-  pid_t pid;     /* If compressed, the pid of compressor, else zero */
+  pid_t pid;     /* The subprocess PID; undefined if state == UNCOMPRESSED.  */
+  char state;
   char name[1];  /* Actual size is 1 + file name length.  */
 };
 static struct tempnode *volatile temphead;
 static struct tempnode *volatile *temptail = &temphead;
 
+/* A file to be sorted.  */
 struct sortfile
 {
+  /* The file's name.  */
   char const *name;
-  pid_t pid;     /* If compressed, the pid of compressor, else zero */
-};
 
-/* An integer that is the same size as a pointer.  To avoid GCC warnings,
-   cast from void * to this type rather than directly to pid_t.  */
-#ifdef UINTPTR_MAX
-typedef uintptr_t uintptr;
-#else
-typedef size_t uintptr;
-#endif
-verify (sizeof (pid_t) <= sizeof (uintptr));
+  /* Nonnull if this is a temporary file, in which case NAME == TEMP->name.  */
+  struct tempnode *temp;
+};
 
-/* IDs of unreaped compression and decompression subprocesses.  */
+/* Map PIDs of unreaped subprocesses to their struct tempnode objects.  */
 static Hash_table *proctab;
 
 enum { INIT_PROCTAB_SIZE = 47 };
@@ -643,16 +644,16 @@ enum { INIT_PROCTAB_SIZE = 47 };
 static size_t
 proctab_hasher (void const *entry, size_t tabsize)
 {
-  pid_t pid = (uintptr) entry;
-  return pid % tabsize;
+  struct tempnode const *node = entry;
+  return node->pid % tabsize;
 }
 
 static bool
 proctab_comparator (void const *e1, void const *e2)
 {
-  pid_t p1 = (uintptr) e1;
-  pid_t p2 = (uintptr) e2;
-  return p1 == p2;
+  struct tempnode const *n1 = e1;
+  struct tempnode const *n2 = e2;
+  return n1->pid == n2->pid;
 }
 
 /* The number of unreaped child processes.  */
@@ -690,11 +691,11 @@ reap (pid_t pid)
   return cpid;
 }
 
-/* Add PID to the process table.  Create the process table the first
-   time it's called.  */
+/* TEMP represents a new process; add it to the process table.  Create
+   the process table the first time it's called.  */
 
 static void
-register_proc (pid_t pid)
+register_proc (struct tempnode *temp)
 {
   if (! proctab)
     {
@@ -706,8 +707,9 @@ register_proc (pid_t pid)
         xalloc_die ();
     }
 
-  uintptr p = pid;
-  if (! hash_insert (proctab, (void *) p))
+  temp->state = UNREAPED;
+
+  if (! hash_insert (proctab, temp))
     xalloc_die ();
 }
 
@@ -717,8 +719,14 @@ register_proc (pid_t pid)
 static bool
 delete_proc (pid_t pid)
 {
-  uintptr p = pid;
-  return !! hash_delete (proctab, (void *) p);
+  struct tempnode test;
+
+  test.pid = pid;
+  struct tempnode *node = hash_delete (proctab, &test);
+  if (! node)
+    return false;
+  node->state = REAPED;
+  return true;
 }
 
 /* Remove PID from the process table, and wait for it to exit if it
@@ -802,7 +810,6 @@ create_temp_file (int *pfd, bool survive_fd_exhaustion)
   memcpy (file, temp_dir, len);
   memcpy (file + len, slashbase, sizeof slashbase);
   node->next = NULL;
-  node->pid = 0;
   if (++temp_dir_index == temp_dir_count)
     temp_dir_index = 0;
 
@@ -1010,23 +1017,21 @@ pipe_fork (int pipefds[2], size_t tries)
 #endif
 }
 
-/* Create a temporary file and start a compression program to filter output
-   to that file.  Set *PFP to the file handle and if PPID is non-NULL,
-   set *PPID to the PID of the newly-created process.  If the creation
+/* Create a temporary file and, if asked for, start a compressor
+   to that file.  Set *PFP to the file handle and return
+   the address of the new temp node.  If the creation
    fails, return NULL if the failure is due to file descriptor
    exhaustion and SURVIVE_FD_EXHAUSTION; otherwise, die.  */
 
-static char *
-maybe_create_temp (FILE **pfp, pid_t *ppid, bool survive_fd_exhaustion)
+static struct tempnode *
+maybe_create_temp (FILE **pfp, bool survive_fd_exhaustion)
 {
   int tempfd;
   struct tempnode *node = create_temp_file (&tempfd, survive_fd_exhaustion);
-  char *name;
-
   if (! node)
     return NULL;
 
-  name = node->name;
+  node->state = UNCOMPRESSED;
 
   if (compress_program)
     {
@@ -1039,7 +1044,7 @@ maybe_create_temp (FILE **pfp, pid_t *ppid, bool survive_fd_exhaustion)
           close (pipefds[0]);
           tempfd = pipefds[1];
 
-          register_proc (node->pid);
+          register_proc (node);
         }
       else if (node->pid == 0)
         {
@@ -1053,45 +1058,40 @@ maybe_create_temp (FILE **pfp, pid_t *ppid, bool survive_fd_exhaustion)
             error (SORT_FAILURE, errno, _("couldn't execute %s"),
                    compress_program);
         }
-      else
-        node->pid = 0;
     }
 
   *pfp = fdopen (tempfd, "w");
   if (! *pfp)
-    die (_("couldn't create temporary file"), name);
-
-  if (ppid)
-    *ppid = node->pid;
+    die (_("couldn't create temporary file"), node->name);
 
-  return name;
+  return node;
 }
 
-/* Create a temporary file and start a compression program to filter output
-   to that file.  Set *PFP to the file handle and if *PPID is non-NULL,
-   set it to the PID of the newly-created process.  Die on failure.  */
+/* Create a temporary file and, if asked for, start a compressor
+   to that file.  Set *PFP to the file handle and return the address
+   of the new temp node.  Die on failure.  */
 
-static char *
-create_temp (FILE **pfp, pid_t *ppid)
+static struct tempnode *
+create_temp (FILE **pfp)
 {
-  return maybe_create_temp (pfp, ppid, false);
+  return maybe_create_temp (pfp, false);
 }
 
 /* Open a compressed temp file and start a decompression process through
-   which to filter the input.  PID must be the valid processes ID of the
-   process used to compress the file.  Return NULL (setting errno to
+   which to filter the input.  Return NULL (setting errno to
    EMFILE) if we ran out of file descriptors, and die on any other
    kind of failure.  */
 
 static FILE *
-open_temp (char const *name, pid_t pid)
+open_temp (struct tempnode *temp)
 {
   int tempfd, pipefds[2];
   FILE *fp = NULL;
 
-  wait_proc (pid);
+  if (temp->state == UNREAPED)
+    wait_proc (temp->pid);
 
-  tempfd = open (name, O_RDONLY);
+  tempfd = open (temp->name, O_RDONLY);
   if (tempfd < 0)
     return NULL;
 
@@ -1119,7 +1119,8 @@ open_temp (char const *name, pid_t pid)
              compress_program);
 
     default:
-      register_proc (child);
+      temp->pid = child;
+      register_proc (temp);
       close (tempfd);
       close (pipefds[1]);
 
@@ -1161,6 +1162,9 @@ zaptemp (char const *name)
   for (pnode = &temphead; (node = *pnode)->name != name; pnode = &node->next)
     continue;
 
+  if (node->state == UNREAPED)
+    wait_proc (node->pid);
+
   /* Unlink the temporary file in a critical section to avoid races.  */
   next = node->next;
   cs = cs_enter ();
@@ -2773,8 +2777,8 @@ open_input_files (struct sortfile *files, size_t nfiles, FILE ***pfps)
   /* Open as many input files as we can.  */
   for (i = 0; i < nfiles; i++)
     {
-      fps[i] = (files[i].pid
-                ? open_temp (files[i].name, files[i].pid)
+      fps[i] = (files[i].temp && files[i].temp->state != UNCOMPRESSED
+                ? open_temp (files[i].temp)
                 : stream_open (files[i].name, "r"));
       if (!fps[i])
         break;
@@ -3573,8 +3577,7 @@ avoid_trashing_input (struct sortfile *files, size_t ntemps,
   size_t i;
   bool got_outstat = false;
   struct stat outstat;
-  char const *tempcopy = NULL;
-  pid_t pid IF_LINT (= 0);
+  struct tempnode *tempcopy = NULL;
 
   for (i = ntemps; i < nfiles; i++)
     {
@@ -3608,12 +3611,12 @@ avoid_trashing_input (struct sortfile *files, size_t ntemps,
           if (! tempcopy)
             {
               FILE *tftp;
-              tempcopy = create_temp (&tftp, &pid);
-              mergefiles (&files[i], 0, 1, tftp, tempcopy);
+              tempcopy = create_temp (&tftp);
+              mergefiles (&files[i], 0, 1, tftp, tempcopy->name);
             }
 
-          files[i].name = tempcopy;
-          files[i].pid = pid;
+          files[i].name = tempcopy->name;
+          files[i].temp = tempcopy;
         }
     }
 }
@@ -3648,13 +3651,12 @@ merge (struct sortfile *files, size_t ntemps, size_t nfiles,
       for (out = in = 0; nmerge <= nfiles - in; out++)
         {
           FILE *tfp;
-          pid_t pid;
-          char *temp = create_temp (&tfp, &pid);
+          struct tempnode *temp = create_temp (&tfp);
           size_t num_merged = mergefiles (&files[in], MIN (ntemps, nmerge),
-                                          nmerge, tfp, temp);
+                                          nmerge, tfp, temp->name);
           ntemps -= MIN (ntemps, num_merged);
-          files[out].name = temp;
-          files[out].pid = pid;
+          files[out].name = temp->name;
+          files[out].temp = temp;
           in += num_merged;
         }
 
@@ -3668,13 +3670,12 @@ merge (struct sortfile *files, size_t ntemps, size_t nfiles,
              files as possible, to avoid needless I/O.  */
           size_t nshortmerge = remainder - cheap_slots + 1;
           FILE *tfp;
-          pid_t pid;
-          char *temp = create_temp (&tfp, &pid);
+          struct tempnode *temp = create_temp (&tfp);
           size_t num_merged = mergefiles (&files[in], MIN (ntemps, nshortmerge),
-                                          nshortmerge, tfp, temp);
+                                          nshortmerge, tfp, temp->name);
           ntemps -= MIN (ntemps, num_merged);
-          files[out].name = temp;
-          files[out++].pid = pid;
+          files[out].name = temp->name;
+          files[out++].temp = temp;
           in += num_merged;
         }
 
@@ -3717,21 +3718,21 @@ merge (struct sortfile *files, size_t ntemps, size_t nfiles,
          (e.g., some other process could open a file between the time
          we closed and tried to create).  */
       FILE *tfp;
-      pid_t pid;
-      char *temp;
+      struct tempnode *temp;
       do
         {
           nopened--;
           xfclose (fps[nopened], files[nopened].name);
-          temp = maybe_create_temp (&tfp, &pid, ! (nopened <= 2));
+          temp = maybe_create_temp (&tfp, ! (nopened <= 2));
         }
       while (!temp);
 
       /* Merge into the newly allocated temporary.  */
-      mergefps (&files[0], MIN (ntemps, nopened), nopened, tfp, temp, fps);
+      mergefps (&files[0], MIN (ntemps, nopened), nopened, tfp, temp->name,
+                fps);
       ntemps -= MIN (ntemps, nopened);
-      files[0].name = temp;
-      files[0].pid = pid;
+      files[0].name = temp->name;
+      files[0].temp = temp;
 
       memmove (&files[1], &files[nopened], (nfiles - nopened) * sizeof *files);
       ntemps++;
@@ -3807,7 +3808,7 @@ sort (char *const *files, size_t nfiles, char const *output_file,
           else
             {
               ++ntemps;
-              temp_output = create_temp (&tfp, NULL);
+              temp_output = create_temp (&tfp)->name;
             }
           if (1 < buf.nlines)
             {
@@ -3849,7 +3850,7 @@ sort (char *const *files, size_t nfiles, char const *output_file,
       for (i = 0; node; i++)
         {
           tempfiles[i].name = node->name;
-          tempfiles[i].pid = node->pid;
+          tempfiles[i].temp = node;
           node = node->next;
         }
       merge (tempfiles, ntemps, ntemps, output_file);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index de06704..06d81f0 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -228,6 +228,7 @@ TESTS =						\
   misc/sort					\
   misc/sort-benchmark-random			\
   misc/sort-compress				\
+  misc/sort-compress-hang			\
   misc/sort-compress-proc			\
   misc/sort-continue				\
   misc/sort-debug-keys				\
diff --git a/tests/misc/sort-compress-hang b/tests/misc/sort-compress-hang
new file mode 100755
index 0000000..a536b1f
--- /dev/null
+++ b/tests/misc/sort-compress-hang
@@ -0,0 +1,53 @@
+#!/bin/sh
+# Test for sort --compress hang
+
+# Copyright (C) 2010 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+print_ver_ sort
+very_expensive_
+
+cat <<\EOF >compress || framework_failure
+#!/bin/sh
+tr 41 14 || exit
+touch ok
+EOF
+
+chmod +x compress
+
+seq -w 200000 > exp || fail=1
+tac exp > in || fail=1
+
+# When the bug occurs, 'sort' hangs forever.  When it doesn't occur,
+# 'sort' could be running slowly on an overburdened machine.
+# On a circa-2010 Linux server using NFS, a successful test completes
+# in about 170 seconds, so specify 1700 seconds as a safety margin.
+timeout 1700 sort --compress-program=./compress -S 1k in > out || fail=1
+
+compare exp out || fail=1
+test -f ok || fail=1
+rm -f compress ok
+
+# If $TMPDIR is relative, give subprocesses time to react when 'sort' exits.
+# Otherwise, under NFS, when 'sort' unlinks the temp files and they
+# are renamed to .nfsXXXX instead of being removed, the parent cleanup
+# of this directory will fail because the files are still open.
+case $TMPDIR in
+/*) ;;
+*) sleep 1;;
+esac
+
+Exit $fail
-- 
1.7.2






Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7597; Package coreutils. (Thu, 16 Dec 2010 22:32:02 GMT) Full text and rfc822 format available.

Message #44 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chen Guo <chen.guo.0625 <at> gmail.com>, coreutils <at> gnu.org,
	bug-coreutils <at> gnu.org, Jim Meyering <jim <at> meyering.net>,
	DJ Lucas <dj <at> linuxfromscratch.org>
Subject: Re: bug#7597: multi-threaded sort can segfault (unrelated to the
	sort	-u segfault)
Date: Thu, 16 Dec 2010 22:35:03 +0000
On 16/12/10 22:06, Paul Eggert wrote:
> On 12/12/10 07:41, Jim Meyering wrote:
>> Paul Eggert wrote:
>>> There are also some test cases I need to add for the
>>> (unrelated) sort-compression bug, which is next on my
>>> list of coreutils bugs to look at.
>>
>> It would be great to fix that for 8.8, too,
>> but don't worry if you don't get to it soon.
> 
> I've fixed that now, with the patch enclosed

From a quick inspection, it looks good!

All sort tests pass on 32 bit.

$ time (cd tests && make check TESTS=misc/sort-compress-hang RUN_VERY_EXPENSIVE_TESTS=yes VERBOSE=yes)
PASS: misc/sort-compress-hang

real	1m53.543s
user	0m18.475s
sys	1m5.758s

cheers,
Pádraig.




Added tag(s) fixed. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:42:03 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 7597 <at> debbugs.gnu.org and Jim Meyering <jim <at> meyering.net> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:42:04 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 09 Nov 2018 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 17 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.