Package: coreutils;
To reply to this bug, email your comments to 40533 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-coreutils <at> gnu.org
:bug#40533
; Package coreutils
.
(Fri, 10 Apr 2020 11:12:02 GMT) Full text and rfc822 format available.Ole Tange <tange <at> gnu.org>
:bug-coreutils <at> gnu.org
.
(Fri, 10 Apr 2020 11:12:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Ole Tange <tange <at> gnu.org> To: bug-coreutils <at> gnu.org Subject: GNU sort could be faster on machines with more cores Date: Fri, 10 Apr 2020 13:11:31 +0200
$ sort --version sort (GNU coreutils) 8.28 I am the author of GNU Parallel and my fetish is to try to run more stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file: export LC_ALL=C time sort --parallel=48 bigfile >/dev/null This takes 87 seconds on my 48 core machine. Everything is done in a RAM disk. By adjusting 48 I find the optimal is --parallel=46: 80 seconds. But it is possible to do better than that. A big part of the final part of the sorting is a single tread. It is likely merging the results of the other threads, but it leaves the other 47 cores idle. So I thought: Can we parallelize that more? The reasoning is: If you have N-1 cores sitting idle, you really do not care if you waste some CPU cycles, if you can get the result faster. So I tested. Let us start by looking at an easier case: We do not have 1 big file, but the bigfile is split into 48 files of similar sizes. In this case we can do: sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) ) <(sort -m <(sort file3) <(sort file4) ) ) <(sort -m <(sort -m <(sort file5) <(sort file6) ) <(sort -m <(sort file7) <(sort file8) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) ) <(sort -m <(sort file11) <(sort file12) ) ) <(sort -m <(sort -m <(sort file13) <(sort file14) ) <(sort -m <(sort file15) <(sort file16) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) ) <(sort -m <(sort file19) <(sort file20) ) ) <(sort -m <(sort -m <(sort file21) <(sort file22) ) <(sort -m <(sort file23) <(sort file24) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) ) <(sort -m <(sort file27) <(sort file28) ) ) <(sort -m <(sort -m <(sort file29) <(sort file30) ) <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) ) <(sort -m <(sort file35) <(sort file36) ) ) <(sort -m <(sort -m <(sort file37) <(sort file38) ) <(sort -m <(sort file39) <(sort file40) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) ) <(sort -m <(sort file43) <(sort file44) ) ) <(sort -m <(sort -m <(sort file45) <(sort file46) ) <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) Or indented: sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file1) <(sort file2) )\ <(sort -m <(sort file3) <(sort file4) ) )\ <(sort -m \ <(sort -m <(sort file5) <(sort file6) )\ <(sort -m <(sort file7) <(sort file8) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file9) <(sort file10) )\ <(sort -m <(sort file11) <(sort file12) ) )\ <(sort -m \ <(sort -m <(sort file13) <(sort file14) )\ <(sort -m <(sort file15) <(sort file16) ) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file17) <(sort file18) )\ <(sort -m <(sort file19) <(sort file20) ) )\ <(sort -m <(sort -m <(sort file21) <(sort file22) )\ <(sort -m <(sort file23) <(sort file24) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file25) <(sort file26) )\ <(sort -m <(sort file27) <(sort file28) ) )\ <(sort -m \ <(sort -m <(sort file29) <(sort file30) )\ <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file33) <(sort file34) )\ <(sort -m <(sort file35) <(sort file36) ) )\ <(sort -m <(sort -m <(sort file37) <(sort file38) )\ <(sort -m <(sort file39) <(sort file40) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file41) <(sort file42) )\ <(sort -m <(sort file43) <(sort file44) ) )\ <(sort -m \ <(sort -m <(sort file45) <(sort file46) )\ <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) So WTF is going on here? Here we do an initial sort of each of the 48 files. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible. This takes 60 seconds. But a lot of time is waiting for pipes. This can be lowered by added a buffer after each merge (mbuffer -m 30M): sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) | mbuffer -m 30M ) <(sort -m <(sort file3) <(sort file4) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file5) <(sort file6) | mbuffer -m 30M ) <(sort -m <(sort file7) <(sort file8) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) | mbuffer -m 30M ) <(sort -m <(sort file11) <(sort file12) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file13) <(sort file14) | mbuffer -m 30M ) <(sort -m <(sort file15) <(sort file16) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) | mbuffer -m 30M ) <(sort -m <(sort file19) <(sort file20) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file21) <(sort file22) | mbuffer -m 30M ) <(sort -m <(sort file23) <(sort file24) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) | mbuffer -m 30M ) <(sort -m <(sort file27) <(sort file28) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file29) <(sort file30) | mbuffer -m 30M ) <(sort -m <(sort file31) <(sort file32) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) | mbuffer -m 30M ) <(sort -m <(sort file35) <(sort file36) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file37) <(sort file38) | mbuffer -m 30M ) <(sort -m <(sort file39) <(sort file40) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) | mbuffer -m 30M ) <(sort -m <(sort file43) <(sort file44) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file45) <(sort file46) | mbuffer -m 30M ) <(sort -m <(sort file47) <(sort file48) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) This takes 24 seconds. This is including overhead of starting multiple sorts and mbuffers, and moving everything through pipes. If this was in a single process with a single address space, I imagine this would be even faster: Instead of moving data around, it should be possible to move only pointers around. It may also be possible to optimize the merge sort when there are only two inputs: i1 = read(input1) i2 = read(input2) while(input1 and input2) { while(i1 <= i2) { print i1 i1 = read(input1) } while(i2 <= i1) { print i2 i2 = read(input2) } } But I cheated a little: I started with 48 evenly sized files. I think this could easily be emulated by having a single reader thread spread blocks of full lines in a round robin style to the 48 threads that will do the initial sorting. Something like: while(read_block()) { find_last_newline(); write_block_to_process(n%48); n++; } If we are reading from a disk, then the reading may be slow, so if the 48 initial sorters can start sorting without having the full input, then we will save some wall clock time. /Ole
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.