GNU bug report logs - #49340
small sort takes hours for UTF-8 locale

Reported by: Jon Klaas <blagothakus <at> gmail.com>

Date: Fri, 2 Jul 2021 20:52:01 UTC

Severity: normal

To reply to this bug, email your comments to 49340 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#49340; Package coreutils. (Fri, 02 Jul 2021 20:52:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jon Klaas <blagothakus <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 02 Jul 2021 20:52:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jon Klaas <blagothakus <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: small sort takes hours for UTF-8 locale
Date: Fri, 2 Jul 2021 14:32:21 -0500

[Message part 1 (text/plain, inline)]

Hello,

I encountered a file that was taking hours to sort that was expected
to take negligible time.  This seems to be due to the locale
LANG=en_US.UTF-8.  I've worked around the problem by using LC_ALL=C, but
thought I would report this, as I didn't see a relevant bug report.

This was seen on centos 8 using package
coreutils-8.30-6.el8.x86_64
and the current
coreutils-8.30-8.el8.x86_64


#takes under 1 second.
export LC_ALL=C
sort tst00776.out

#slow sort takes many hours
export LC_ALL=en_US.UTF-8
sort tst00776.out

Looks like most of the time is consumed here:

#0  0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
#1  0x00005600d195d365 in strcoll_loop ()
#2  0x00005600d195bebd in xmemcoll0 ()
#3  0x00005600d1951176 in compare ()
#4  0x00005600d1951224 in sequential_sort ()
#5  0x00005600d19511d5 in sequential_sort ()
#6  0x00005600d195374b in sortlines ()
#7  0x00005600d194d96b in main ()

It's possible the input (attached) has invalid UTF-8.

I also tried on an older RHEL 7 and did NOT reproduce the problem with
coreutils.x86_64                    8.22-23.el7

Thanks,

Jon Klaas

[Message part 2 (text/html, inline)]

[tst00776.out.gz (application/x-gzip, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#49340; Package coreutils. (Fri, 02 Jul 2021 23:20:02 GMT) Full text and rfc822 format available.

Message #8 received at 49340 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Jon Klaas <blagothakus <at> gmail.com>, 49340 <at> debbugs.gnu.org
Subject: Re: bug#49340: small sort takes hours for UTF-8 locale
Date: Sat, 3 Jul 2021 00:19:07 +0100

On 02/07/2021 20:32, Jon Klaas wrote:
> Hello,
> 
> I encountered a file that was taking hours to sort that was expected
> to take negligible time.  This seems to be due to the locale
> LANG=en_US.UTF-8.  I've worked around the problem by using LC_ALL=C, but
> thought I would report this, as I didn't see a relevant bug report.
> 
> This was seen on centos 8 using package
> coreutils-8.30-6.el8.x86_64
> and the current
> coreutils-8.30-8.el8.x86_64
> 
> 
> #takes under 1 second.
> export LC_ALL=C
> sort tst00776.out
> 
> #slow sort takes many hours
> export LC_ALL=en_US.UTF-8
> sort tst00776.out
> 
> Looks like most of the time is consumed here:
> 
> #0  0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
> #1  0x00005600d195d365 in strcoll_loop ()
> #2  0x00005600d195bebd in xmemcoll0 ()
> #3  0x00005600d1951176 in compare ()
> #4  0x00005600d1951224 in sequential_sort ()
> #5  0x00005600d19511d5 in sequential_sort ()
> #6  0x00005600d195374b in sortlines ()
> #7  0x00005600d194d96b in main ()
> 
> It's possible the input (attached) has invalid UTF-8.
> 
> I also tried on an older RHEL 7 and did NOT reproduce the problem with
> coreutils.x86_64                    8.22-23.el7

There are 7 lines in that input that are 65500 characters long,
which is triggering the slow behavior.
You can see the length distribution like:
  awk '{print length}' < tst00776.out | uniq -c | less

There are no NUL bytes:
  $ grep -Pa '\x00' tst00776.out | wc -l
  0

Also it's just ASCII data:
  $ iconv -fUTF8 -tASCII < tst00776.out | wc -l
  11743

Since your data is ASCII, using LC_ALL=C is most appropriate to avoid strcoll():
  $ LC_ALL=C sort < tst00776.out | wc -l
  11743

You could also limit the length of lines compared with:
  $ sort -k1,1.80 -s < tst00776.out | wc -l
  11743

The vast majority of the time is spent in strcoll()
so this is a glibc issue rather than coreutils.
I think this is tracked in glibc at:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441

Now saying that, we might be able to improve things.
For example, using strxfrm() + strcmp() to minimize processing.

cheers,
Pádraig

Information forwarded to bug-coreutils <at> gnu.org:
bug#49340; Package coreutils. (Sat, 03 Jul 2021 00:27:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#49340: small sort takes hours for UTF-8 locale
Date: Fri, 2 Jul 2021 17:25:51 -0700

On 7/2/21 4:19 PM, Pádraig Brady wrote:
> we might be able to improve things.
> For example, using strxfrm() + strcmp() to minimize processing.

I tried that long ago, and it was waaayyy slower than strcoll in the 
typical case. glibc strxfrm is not at all optimized.

Which is fine, since strxfrm is a dumb API: its only point is 
performance but its portable API is inherently low-performance for 
typical uses. I've never seen it useful.

In short, this is a glibc strcoll bug and should be fixed there.

This bug report was last modified 2 years and 306 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #49340 small sort takes hours for UTF-8 locale

GNU bug report logs - #49340
small sort takes hours for UTF-8 locale