GNU bug report logs -
#60544
sort hangs on lengthy line with invalid UTF8 characters
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60544 in the body.
You can then email your comments to 60544 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#60544
; Package
coreutils
.
(Wed, 04 Jan 2023 07:35:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"DE CARNE DE CARNAVALET, Xavier [COMP]" <xavier.decarnedecarnavalet <at> polyu.edu.hk>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 04 Jan 2023 07:35:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
sort seems to do extra computations on long line with invalid UTF8 characters and could hang for days on just two lines.
Here is the minimal example I could make to reproduce the bug:
$ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
$ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2
To verify:
$ ls -l file*
-rw-rw-r-- 1 u u 4 Jan 4 12:13 file1
-rw-rw-r-- 1 u u 3001 Jan 4 12:13 file2
$ xxd -p file1
cde5e00a
$ xxd -p file2
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
[...]
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
0a
Then:
$ export LC_ALL=en_US.UTF8
$ time sort --debug file1 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real 0m1.951s
user 0m1.951s
sys 0m0.000s
It took nearly two seconds to sort two lines from two files.
If I replace the \xe0 with \x61 in the first (small) file, the time gets down to milliseconds:
$ perl -e 'print "\xcd\xe5\x61"; print "\n"' > file3
$ time sort --debug file3 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real 0m0.007s
user 0m0.003s
sys 0m0.003s
The time it takes increases when one of the file gets larger, see for instance with 2k repetitions:
$ perl -e 'print "\xcd\xe5\xe0"x2000; print "\n"' > file4
$ time sort --debug file1 file4
sort: using 'en_US.UTF8' sorting rules
[...]
real 0m7.696s
user 0m7.690s
sys 0m0.004s
Expectedly, sort should take milliseconds at most in all cases for two moderately long lines.
$ uname -a
Linux 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ apt list installed coreutils
coreutils/focal,now 8.30-3ubuntu2 amd64 [installed]
$ sort --version
sort (GNU coreutils) 8.30
Xavier de Carné de Carnavalet
[https://www.polyu.edu.hk/emaildisclaimer/PolyU_Email_Signature.jpg]<http://www.polyu.edu.hk>
www.polyu.edu.hk<http://www.polyu.edu.hk>
[https://www.polyu.edu.hk/emaildisclaimer/Icons-02.jpg]<https://www.polyu.edu.hk/cpa/online-channels/#ipolyuapp> [https://www.polyu.edu.hk/emaildisclaimer/Icons-03.jpg] <https://www.facebook.com/HongKongPolyU> [https://www.polyu.edu.hk/emaildisclaimer/Icons-04.jpg] <https://www.youtube.com/user/HongKongPolyU> [https://www.polyu.edu.hk/emaildisclaimer/Icons-05.jpg] <https://www.instagram.com/hongkongpolyu/> [https://www.polyu.edu.hk/emaildisclaimer/Icons-06.jpg] <https://www.linkedin.com/school/hong-kong-polytechnic-university/> [https://www.polyu.edu.hk/emaildisclaimer/Icons-07.jpg] <https://twitter.com/HongKongPolyU> [https://www.polyu.edu.hk/emaildisclaimer/Icons-08.jpg] <https://www.polyu.edu.hk/-/media/department/home/setting/polyu-wechat_qr-code_20190903.jpg?bc=ffffff&h=150&w=150&hash=679EE95BCB1796F71B5A4149647785C9> [https://www.polyu.edu.hk/emaildisclaimer/Icons-09.jpg] <https://www.weibo.com/hongkongpolyu>
Disclaimer:
This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful.
The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.
[file1 (application/octet-stream, attachment)]
[file2 (application/octet-stream, attachment)]
[file3 (application/octet-stream, attachment)]
[file4 (application/octet-stream, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#60544
; Package
coreutils
.
(Sun, 08 Jan 2023 22:04:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 60544 <at> debbugs.gnu.org (full text, mbox):
tag 60544 notabug
close 60544
stop
On 04/01/2023 04:38, DE CARNE DE CARNAVALET, Xavier [COMP] wrote:
> sort seems to do extra computations on long line with invalid UTF8 characters and could hang for days on just two lines.
>
> Here is the minimal example I could make to reproduce the bug:
> $ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
> $ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2
> Then:
> $ export LC_ALL=en_US.UTF8
> $ time sort --debug file1 file2
> sort: using 'en_US.UTF8' sorting rules
> [...]
> real 0m1.951s
> user 0m1.951s
> sys 0m0.000s
>
> It took nearly two seconds to sort two lines from two files.
> If I replace the \xe0 with \x61 in the first (small) file, the time gets down to milliseconds:
If I profile sort like:
$ src/sort file1 file2 >/dev/null & perf top -p $!
It shows that all the time is spent in libc's __strcoll_l
I see one strcoll performance bug which might be related:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441
I'd follow up with glibc, also specifying your glibc version.
Marking this as not a coreutils bug for now.
cheers,
Pádraig
Added tag(s) notabug.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sun, 08 Jan 2023 22:04:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
60544 <at> debbugs.gnu.org and "DE CARNE DE CARNAVALET, Xavier [COMP]" <xavier.decarnedecarnavalet <at> polyu.edu.hk>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sun, 08 Jan 2023 22:04:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 06 Feb 2023 12:24:09 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.