GNU bug report logs -
#49340
small sort takes hours for UTF-8 locale
Previous Next
To reply to this bug, email your comments to 49340 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#49340
; Package
coreutils
.
(Fri, 02 Jul 2021 20:52:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jon Klaas <blagothakus <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Fri, 02 Jul 2021 20:52:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello,
I encountered a file that was taking hours to sort that was expected
to take negligible time. This seems to be due to the locale
LANG=en_US.UTF-8. I've worked around the problem by using LC_ALL=C, but
thought I would report this, as I didn't see a relevant bug report.
This was seen on centos 8 using package
coreutils-8.30-6.el8.x86_64
and the current
coreutils-8.30-8.el8.x86_64
#takes under 1 second.
export LC_ALL=C
sort tst00776.out
#slow sort takes many hours
export LC_ALL=en_US.UTF-8
sort tst00776.out
Looks like most of the time is consumed here:
#0 0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
#1 0x00005600d195d365 in strcoll_loop ()
#2 0x00005600d195bebd in xmemcoll0 ()
#3 0x00005600d1951176 in compare ()
#4 0x00005600d1951224 in sequential_sort ()
#5 0x00005600d19511d5 in sequential_sort ()
#6 0x00005600d195374b in sortlines ()
#7 0x00005600d194d96b in main ()
It's possible the input (attached) has invalid UTF-8.
I also tried on an older RHEL 7 and did NOT reproduce the problem with
coreutils.x86_64 8.22-23.el7
Thanks,
Jon Klaas
[Message part 2 (text/html, inline)]
[tst00776.out.gz (application/x-gzip, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49340
; Package
coreutils
.
(Fri, 02 Jul 2021 23:20:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 49340 <at> debbugs.gnu.org (full text, mbox):
On 02/07/2021 20:32, Jon Klaas wrote:
> Hello,
>
> I encountered a file that was taking hours to sort that was expected
> to take negligible time. This seems to be due to the locale
> LANG=en_US.UTF-8. I've worked around the problem by using LC_ALL=C, but
> thought I would report this, as I didn't see a relevant bug report.
>
> This was seen on centos 8 using package
> coreutils-8.30-6.el8.x86_64
> and the current
> coreutils-8.30-8.el8.x86_64
>
>
> #takes under 1 second.
> export LC_ALL=C
> sort tst00776.out
>
> #slow sort takes many hours
> export LC_ALL=en_US.UTF-8
> sort tst00776.out
>
> Looks like most of the time is consumed here:
>
> #0 0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
> #1 0x00005600d195d365 in strcoll_loop ()
> #2 0x00005600d195bebd in xmemcoll0 ()
> #3 0x00005600d1951176 in compare ()
> #4 0x00005600d1951224 in sequential_sort ()
> #5 0x00005600d19511d5 in sequential_sort ()
> #6 0x00005600d195374b in sortlines ()
> #7 0x00005600d194d96b in main ()
>
> It's possible the input (attached) has invalid UTF-8.
>
> I also tried on an older RHEL 7 and did NOT reproduce the problem with
> coreutils.x86_64 8.22-23.el7
There are 7 lines in that input that are 65500 characters long,
which is triggering the slow behavior.
You can see the length distribution like:
awk '{print length}' < tst00776.out | uniq -c | less
There are no NUL bytes:
$ grep -Pa '\x00' tst00776.out | wc -l
0
Also it's just ASCII data:
$ iconv -fUTF8 -tASCII < tst00776.out | wc -l
11743
Since your data is ASCII, using LC_ALL=C is most appropriate to avoid strcoll():
$ LC_ALL=C sort < tst00776.out | wc -l
11743
You could also limit the length of lines compared with:
$ sort -k1,1.80 -s < tst00776.out | wc -l
11743
The vast majority of the time is spent in strcoll()
so this is a glibc issue rather than coreutils.
I think this is tracked in glibc at:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441
Now saying that, we might be able to improve things.
For example, using strxfrm() + strcmp() to minimize processing.
cheers,
Pádraig
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#49340
; Package
coreutils
.
(Sat, 03 Jul 2021 00:27:01 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
On 7/2/21 4:19 PM, Pádraig Brady wrote:
> we might be able to improve things.
> For example, using strxfrm() + strcmp() to minimize processing.
I tried that long ago, and it was waaayyy slower than strcoll in the
typical case. glibc strxfrm is not at all optimized.
Which is fine, since strxfrm is a dumb API: its only point is
performance but its portable API is inherently low-performance for
typical uses. I've never seen it useful.
In short, this is a glibc strcoll bug and should be fixed there.
This bug report was last modified 3 years and 148 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.