GNU bug report logs - #60544
sort hangs on lengthy line with invalid UTF8 characters

Previous Next

Package: coreutils;

Reported by: "DE CARNE DE CARNAVALET, Xavier [COMP]" <xavier.decarnedecarnavalet <at> polyu.edu.hk>

Date: Wed, 4 Jan 2023 07:35:02 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60544 in the body.
You can then email your comments to 60544 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#60544; Package coreutils. (Wed, 04 Jan 2023 07:35:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "DE CARNE DE CARNAVALET, Xavier [COMP]" <xavier.decarnedecarnavalet <at> polyu.edu.hk>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 04 Jan 2023 07:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "DE CARNE DE CARNAVALET, Xavier [COMP]"
 <xavier.decarnedecarnavalet <at> polyu.edu.hk>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: sort hangs on lengthy line with invalid UTF8 characters
Date: Wed, 4 Jan 2023 04:38:33 +0000
[Message part 1 (text/plain, inline)]
sort seems to do extra computations on long line with invalid UTF8 characters and could hang for days on just two lines.

Here is the minimal example I could make to reproduce the bug:
$ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
$ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2

To verify:
$ ls -l file*
-rw-rw-r-- 1 u u    4 Jan  4 12:13 file1
-rw-rw-r-- 1 u u 3001 Jan  4 12:13 file2
$ xxd -p file1
cde5e00a
$ xxd -p file2
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
[...]
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
0a

Then:
$ export LC_ALL=en_US.UTF8
$ time sort --debug file1 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m1.951s
user    0m1.951s
sys     0m0.000s

It took nearly two seconds to sort two lines from two files.
If I replace the \xe0 with \x61 in the first (small) file, the time gets down to milliseconds:
$ perl -e 'print "\xcd\xe5\x61"; print "\n"' > file3
$ time sort --debug file3 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m0.007s
user    0m0.003s
sys     0m0.003s

The time it takes increases when one of the file gets larger, see for instance with 2k repetitions:
$ perl -e 'print "\xcd\xe5\xe0"x2000; print "\n"' > file4
$ time sort --debug file1 file4
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m7.696s
user    0m7.690s
sys     0m0.004s

Expectedly, sort should take milliseconds at most in all cases for two moderately long lines.

$ uname -a
Linux 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ apt list installed coreutils
coreutils/focal,now 8.30-3ubuntu2 amd64 [installed]
$ sort --version
sort (GNU coreutils) 8.30

Xavier de Carné de Carnavalet

[https://www.polyu.edu.hk/emaildisclaimer/PolyU_Email_Signature.jpg]<http://www.polyu.edu.hk>

www.polyu.edu.hk<http://www.polyu.edu.hk>

[https://www.polyu.edu.hk/emaildisclaimer/Icons-02.jpg]<https://www.polyu.edu.hk/cpa/online-channels/#ipolyuapp>                [https://www.polyu.edu.hk/emaildisclaimer/Icons-03.jpg] <https://www.facebook.com/HongKongPolyU>                [https://www.polyu.edu.hk/emaildisclaimer/Icons-04.jpg] <https://www.youtube.com/user/HongKongPolyU>            [https://www.polyu.edu.hk/emaildisclaimer/Icons-05.jpg] <https://www.instagram.com/hongkongpolyu/>              [https://www.polyu.edu.hk/emaildisclaimer/Icons-06.jpg] <https://www.linkedin.com/school/hong-kong-polytechnic-university/>             [https://www.polyu.edu.hk/emaildisclaimer/Icons-07.jpg] <https://twitter.com/HongKongPolyU>             [https://www.polyu.edu.hk/emaildisclaimer/Icons-08.jpg] <https://www.polyu.edu.hk/-/media/department/home/setting/polyu-wechat_qr-code_20190903.jpg?bc=ffffff&h=150&w=150&hash=679EE95BCB1796F71B5A4149647785C9>                [https://www.polyu.edu.hk/emaildisclaimer/Icons-09.jpg] <https://www.weibo.com/hongkongpolyu>

Disclaimer:

This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful.

The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.
[file1 (application/octet-stream, attachment)]
[file2 (application/octet-stream, attachment)]
[file3 (application/octet-stream, attachment)]
[file4 (application/octet-stream, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#60544; Package coreutils. (Sun, 08 Jan 2023 22:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 60544 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: "DE CARNE DE CARNAVALET, Xavier [COMP]"
 <xavier.decarnedecarnavalet <at> polyu.edu.hk>, 60544 <at> debbugs.gnu.org
Subject: Re: bug#60544: sort hangs on lengthy line with invalid UTF8 characters
Date: Sun, 8 Jan 2023 22:03:28 +0000
tag 60544 notabug
close 60544
stop

On 04/01/2023 04:38, DE CARNE DE CARNAVALET, Xavier [COMP] wrote:
> sort seems to do extra computations on long line with invalid UTF8 characters and could hang for days on just two lines.
> 
> Here is the minimal example I could make to reproduce the bug:
> $ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
> $ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2

> Then:
> $ export LC_ALL=en_US.UTF8
> $ time sort --debug file1 file2
> sort: using 'en_US.UTF8' sorting rules
> [...]
> real    0m1.951s
> user    0m1.951s
> sys     0m0.000s
> 
> It took nearly two seconds to sort two lines from two files.
> If I replace the \xe0 with \x61 in the first (small) file, the time gets down to milliseconds:

If I profile sort like:

  $ src/sort file1 file2 >/dev/null & perf top -p $!

It shows that all the time is spent in libc's __strcoll_l
I see one strcoll performance bug which might be related:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441

I'd follow up with glibc, also specifying your glibc version.

Marking this as not a coreutils bug for now.

cheers,
Pádraig




Added tag(s) notabug. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sun, 08 Jan 2023 22:04:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 60544 <at> debbugs.gnu.org and "DE CARNE DE CARNAVALET, Xavier [COMP]" <xavier.decarnedecarnavalet <at> polyu.edu.hk> Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sun, 08 Jan 2023 22:04:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 06 Feb 2023 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 78 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.