GNU bug report logs - #32472
sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

Previous Next

Package: coreutils;

Reported by: Vaayda Yaasra <vaaydayaasra <at> gmail.com>

Date: Sat, 18 Aug 2018 16:05:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 32472 in the body.
You can then email your comments to 32472 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#32472; Package coreutils. (Sat, 18 Aug 2018 16:05:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Vaayda Yaasra <vaaydayaasra <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 18 Aug 2018 16:05:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vaayda Yaasra <vaaydayaasra <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: sort doesn't sort and uniq loses data for many non-Latin scripts on
 UTF-8 locales
Date: Sat, 18 Aug 2018 15:53:44 +0000
[Message part 1 (text/plain, inline)]
I’ve found out that sort doesn’t sort strings for many non-Latin scripts at
all if the locale you’re using is one of en_US.UTF-8, fr_FR.UTF-8 or
fi_FI.UTF-8 (probably others, too, but these are the ones I have tested).
For locales ”C” and ko_KR.UTF-8, things work as expected. Here’s a test
case:

Open xterm, launch sort and input some lines of Syriac, Ethiopic, Korean,
Japanese (Hiragana or Katakana, not Han) or Thai text repeating one of the
lines twice. Here’s an example in Syriac:

ܡܠܬܐ
ܒܝܬܐ
ܒܪܢܫܐ
ܡܠܬܐ

Sort produces the following:

ܡܠܬܐ
ܒܝܬܐ
ܡܠܬܐ
ܒܪܢܫܐ

Here strings are ordered only according to their length but not characters.
Even the two instances of the word ܡܠܬܐ are found on non-adjacent lines (1
and 3). The expected sort order based on Unicode points would be:

ܒܝܬܐ
ܒܪܢܫܐ
ܡܠܬܐ
ܡܠܬܐ

If you further pass sort’s output to uniq, it produces the following:

ܡܠܬܐ
ܒܪܢܫܐ

Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq
seems to consider all Syriac strings of equal length as the same.

Although this issue affects locale, I think it is not a locale issue per
se, since perl seems to handle similar cases as expected. For instance, the
following command produces the expected result:

perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ",
"ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

Curiously enough, codepoints in Plane 1 seem to count as two codepoints of
the basic plane, so that if you sort | uniq the following (six codepoints
of Syriac and three codepoints of Phoenician):

ܥܠܝܟܘܢ
𐤁𐤉𐤕

you get ”ܥܠܝܟܘܢ" as the result whereas ”𐤁𐤉𐤕” is lost. This is of course
due to the UTF-8 representation of Plane 1 characters as two surrogate
characters on the basic plane.

Also curiously, LTR scripts seem to conflate with each other and RTL
scripts among themselves but not across the directionality line, so that if
you sort | uniq the following (three codepoints each in Ethiopic, Hangul,
Syriac, Hiragana and Thai):

ዘመን
스물셋
ܐܢܐ
わたし
ฟ้า

you are left with:

ܐܢܐ
ዘመን

That’s one line of Syriac and one line of Ethiopic; everything else was
lost. This issue does not seem to affect most Indic scripts (Devanagari,
Bengali, Telugu etc.) or Arabic. For CJK, things work as expected for the
main Unicode block (4E00..9FFF) but not for Extension A (3400..4DBF, such
as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work fine but all polytonic
letters are conflated (αὐλὸς and αὐλῆς conflate to αὐλῆς). For Hebrew,
letters and vowel marks work fine but cantillation marks are conflated.

I'm using coreutils 8.28 on Ubuntu 18.04. I first reported this bug on
Launchpad at
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1774857 but since
nobody hasn't reacted for a couple of months, I decided to post the report
here.
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32472; Package coreutils. (Sat, 18 Aug 2018 17:35:02 GMT) Full text and rfc822 format available.

Message #8 received at 32472 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vaayda Yaasra <vaaydayaasra <at> gmail.com>, 32472 <at> debbugs.gnu.org
Subject: Re: bug#32472: sort doesn't sort and uniq loses data for many
 non-Latin scripts on UTF-8 locales
Date: Sat, 18 Aug 2018 10:34:31 -0700
Vaayda Yaasra wrote:
> Here’s an example in Syriac:
> 
> ܡܠܬܐ
> ܒܝܬܐ
> ܒܪܢܫܐ
> ܡܠܬܐ
> 
> Sort produces the following:
> 
> ܡܠܬܐ
> ܒܝܬܐ
> ܡܠܬܐ
> ܒܪܢܫܐ

This is a property of your locale, so I suggest sending a bug report to whoever 
maintains your locale. You should be able to reproduce the problem by bypassing 
GNU 'sort' entirely and using the C strcoll function.

For what it's worth, I observe the problem on Ubuntu 18.04 but not on Fedora 28. 
As Fedora tends to be more up-to-date, perhaps the problem is fixed already in 
glibc.




Information forwarded to bug-coreutils <at> gnu.org:
bug#32472; Package coreutils. (Tue, 30 Oct 2018 03:56:02 GMT) Full text and rfc822 format available.

Message #11 received at 32472 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Vaayda Yaasra <vaaydayaasra <at> gmail.com>, 32472 <at> debbugs.gnu.org
Subject: Re: bug#32472: sort doesn't sort and uniq loses data for many
 non-Latin scripts on UTF-8 locales
Date: Mon, 29 Oct 2018 21:54:59 -0600
tags 32472 notabug
close 32472
stop


On 2018-08-18 11:34 a.m., Paul Eggert wrote:
> Vaayda Yaasra wrote:
>> Here’s an example in Syriac:
>>
>> ܡܠܬܐ
>> ܒܝܬܐ
>> ܒܪܢܫܐ
>> ܡܠܬܐ
>>
>> Sort produces the following:
>>
>> ܡܠܬܐ
>> ܒܝܬܐ
>> ܡܠܬܐ
>> ܒܪܢܫܐ
> 
> This is a property of your locale, so I suggest sending a bug report to 
> whoever maintains your locale. You should be able to reproduce the 
> problem by bypassing GNU 'sort' entirely and using the C strcoll function.
> 
> For what it's worth, I observe the problem on Ubuntu 18.04 but not on 
> Fedora 28. As Fedora tends to be more up-to-date, perhaps the problem is 
> fixed already in glibc.

Given the above, and with no further comments,
I'm closing this bug.

-assaf




Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:56:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 32472 <at> debbugs.gnu.org and Vaayda Yaasra <vaaydayaasra <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:56:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 27 Nov 2018 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 123 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.