GNU bug report logs - #46169
Parallelize merge sort

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Reported by: Ole Tange <tange@HIDDEN>; dated Fri, 29 Jan 2021 09:08:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 29 Jan 2021 09:07:32 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jan 29 04:07:32 2021
Received: from localhost ([127.0.0.1]:51147 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1l5PkZ-0007Pp-Ve
	for submit <at> debbugs.gnu.org; Fri, 29 Jan 2021 04:07:32 -0500
Received: from lists.gnu.org ([209.51.188.17]:42248)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ole.tange@HIDDEN>) id 1l5PkW-0007Pd-8g
 for submit <at> debbugs.gnu.org; Fri, 29 Jan 2021 04:07:30 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:41866)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ole.tange@HIDDEN>)
 id 1l5PkV-0007Cs-KS
 for bug-coreutils@HIDDEN; Fri, 29 Jan 2021 04:07:27 -0500
Received: from mail-ot1-f43.google.com ([209.85.210.43]:45769)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <ole.tange@HIDDEN>)
 id 1l5PkT-0000CU-Ul
 for bug-coreutils@HIDDEN; Fri, 29 Jan 2021 04:07:27 -0500
Received: by mail-ot1-f43.google.com with SMTP id n42so7962828ota.12
 for <bug-coreutils@HIDDEN>; Fri, 29 Jan 2021 01:07:25 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=C3pWjYX3HBGjH008hGvDvXBF7PpbTOkeNOImYlYRKTY=;
 b=bituOmv/Xto1N6IdxSazknTi3o88oEl0Ec/mdO9lQISrXhpa21K/h5GZRbP+8m7ni5
 jVCjYX/2bV5MYQmovqanHMc+xy7Gpo9XfaE5/SLzIB+HcrU91tDD6TSucFPjRjPmZDwL
 b3WvW2iEaCgqVNpeE1o7+zYnhOIAEZOSIK0NmV+Xen/bs4OmJ9s23g/W1vwiY1MOCfPA
 pJakVO/Vudn2ZBRyMlgVODMclexXMCtlHMhB295ZBDFeHvFZRX8tjH6JCeEZoAjMFefQ
 h6KTVp6qCmyJlnvzA2NCXQl7hzkyJMkSsiacbEmaza7sUriKVxH8agpSnREzEWQ6Bm4b
 SvlA==
X-Gm-Message-State: AOAM531p5B8/wOW9/gxBpgcvdiivK5Rb7tVIX3Qv4Z2CRMg+ictMoUmK
 7ESq8smfOdvZfK9q+wiycICKPcTJ/9x/Ud89HFKWeTN0wCU=
X-Google-Smtp-Source: ABdhPJwXzci+BZ8iZefoVlOgL1VGki6+A7syhulbq3zHSYKV5sm5Ss/CX1HcV2IRUT1+ZHSXGK+s7v8bGy6qYSdwZy8=
X-Received: by 2002:a05:6830:1ac3:: with SMTP id
 r3mr2552641otc.363.1611911244129; 
 Fri, 29 Jan 2021 01:07:24 -0800 (PST)
MIME-Version: 1.0
From: Ole Tange <tange@HIDDEN>
Date: Fri, 29 Jan 2021 10:07:12 +0100
Message-ID: <CA+4vN7w1kiqihiTBB0LwKuC4Kav+ENUjaf3fDhr-7dpuQT5XmA@HIDDEN>
Subject: Parallelize merge sort
To: bug-coreutils@HIDDEN
Content-Type: text/plain; charset="UTF-8"
Received-SPF: pass client-ip=209.85.210.43; envelope-from=ole.tange@HIDDEN;
 helo=mail-ot1-f43.google.com
X-Spam_score_int: 10
X-Spam_score: 1.0
X-Spam_bar: +
X-Spam_report: (1.0 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=0.249,
 FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 TO_NO_BRKTS_PCNT=2.403 autolearn=no autolearn_force=no
X-Spam_action: no action
X-Spam-Score: 1.6 (+)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview: During the past 6 months I have been using GNU sort on a lot
 of files in the GByte range on my 64 core machine. I have discovered, that
 GNU sort is bad at using all the cores. I have built parsort (now distributed
 as part of GNU parallel) as a small wrapper around GNU sort and it performs
 2-3x better than GNU sort for files of this size on this machine. 
 Content analysis details:   (1.6 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -2.3 RCVD_IN_DNSWL_MED      RBL: Sender listed at https://www.dnswl.org/,
 medium trust [209.51.188.17 listed in list.dnswl.org]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 0.0 FREEMAIL_FROM          Sender email is commonly abused enduser mail
 provider (ole.tange[at]gmail.com)
 1.0 SPF_SOFTFAIL           SPF: sender does not match SPF record (softfail)
 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level
 mail domains are different
 0.0 RCVD_IN_MSPIKE_H4      RBL: Very Good reputation (+4)
 [209.51.188.17 listed in wl.mailspike.net]
 0.2 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and
 EnvelopeFrom freemail headers are different
 0.0 RCVD_IN_MSPIKE_WL      Mailspike good senders
 2.4 TO_NO_BRKTS_PCNT       To: lacks brackets + percentage
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.8 (-)

During the past 6 months I have been using GNU sort on a lot of files in
the GByte range on my 64 core machine. I have discovered, that GNU sort
is bad at using all the cores.

I have built parsort (now distributed as part of GNU parallel) as a
small wrapper around GNU sort and it performs 2-3x better than GNU sort
for files of this size on this machine.

The first step works great in GNU sort: Here you can really use all
your cores. But after the first step, it seems GNU sort does a merge
sort of all the results. And this pegs a single CPU at 100% while the
rest of the cores are pretty lazy.

It seems the merging is not parallelized.

parsort does that a little better than GNU sort. Instead of doing a
k-way merge at the final step, it does a 2-way merge on each core. It
still has to a 2-way merge of all data on a single core in the end, so
it is not even close to optimal.

Wikipedia describes a way of doing parallel merge:

https://en.wikipedia.org/wiki/Merge_sort#Parallel_multiway_merge_sort

I am no C-coder, and I respect that we want coreutils to be written in
C. Unfortunately that means I cannot really help doing the actual
coding (at least not in a way that would not introduce a plethora of
bugs).

But looking at the pseudocode
https://en.wikipedia.org/wiki/Merge_sort#Pseudocode it does not look
that hard to implement.

This method requires all data to fit in memory, but there are even
more advanced methods (such as
https://www.sciencedirect.com/science/article/pii/S0304397500002875)
that would not require this.

Could you consider implementing a parallel merge, so I can retire
parsort?

(Also: Thanks for maintaining a vital piece of software).

/Ole




Acknowledgement sent to Ole Tange <tange@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to bug-coreutils@HIDDEN:
bug#46169; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 29 Jan 2021 09:15:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.