GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings

Reported by: Roy Smith <roy <at> panix.com>

Date: Sun, 15 Dec 2019 19:41:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38627 in the body.
You can then email your comments to 38627 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Sun, 15 Dec 2019 19:41:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roy Smith <roy <at> panix.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 15 Dec 2019 19:41:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Roy Smith <roy <at> panix.com>
To: bug-coreutils <at> gnu.org
Subject: uniq -c gets wrong count with non-ascii strings
Date: Sun, 15 Dec 2019 14:40:14 -0500

[Message part 1 (text/plain, inline)]

With the following input:

> $ cat x
> "ⁿᵘˡˡ"
> "ܥܝܪܐܩ"


Running "uniq -c" says there's two copies of the same line!

> $ uniq -c x
>       2 "ⁿᵘˡˡ"


I've attached a copy of the test file, and here's the octal dump:

> $ od -b x
> 0000000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 334 245
> 0000020 334 235 334 252 334 220 334 251 042 012
> 0000032


I'm getting this on:

> Linux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
> uniq (GNU coreutils) 8.26

My MacOS 10.13.6 box gets it right:

> $ uniq -c x
>    1 "ⁿᵘˡˡ"
>    1 "ܥܝܪܐܩ"

[Message part 2 (text/html, inline)]

[x (application/octet-stream, attachment)]

[Message part 4 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Mon, 16 Dec 2019 09:42:01 GMT) Full text and rfc822 format available.

Message #8 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Roy Smith <roy <at> panix.com>
Cc: Jim Meyering <jim <at> meyering.net>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Mon, 16 Dec 2019 01:41:13 -0800

On 12/15/19 11:40 AM, Roy Smith wrote:
> With the following input:
> 
>> $ cat x
>> "ⁿᵘˡˡ"
>> "ܥܝܪܐܩ"
> 
> 
> Running "uniq -c" says there's two copies of the same line!
> 
>> $ uniq -c x
>>       2 "ⁿᵘˡˡ"

Thanks for the bug report. I expect this is because GNU 'uniq' uses the
equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
lines compare equal in your locale, GNU 'uniq' says there's just one line.

The GNU 'uniq' behavior appears to be a consequence of this commit:

commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
Author: Jim Meyering <jim <at> meyering.net>
Date:   Fri Aug 2 14:42:37 2002 +0000

with a change noted this way in NEWS:

* uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.

However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
and I expect this means that the 2002 commit should be reverted so that GNU
'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense anyway).

I'll CC: this email to Jim Meyering to see whether he has an opinion about this.

In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
of plain 'uniq' in your shell script.

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Tue, 17 Dec 2019 00:47:02 GMT) Full text and rfc822 format available.

Message #11 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Roy Smith <roy <at> panix.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Mon, 16 Dec 2019 19:46:39 -0500

[Message part 1 (text/plain, inline)]

Yup, this does depend on the locale.  In my original example, I had LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:

> $ LANG=C.UTF-8 uniq -c x
>       1 "ⁿᵘˡˡ"
>       1 "ܥܝܪܐܩ"


But, that doesn't fully explain what's going on.  I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same.  I've been playing around with the ICU string compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't reproduce this there.  Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate.

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Tue, 17 Dec 2019 17:26:01 GMT) Full text and rfc822 format available.

Message #14 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Roy Smith <roy <at> panix.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Tue, 17 Dec 2019 12:25:54 -0500

[Message part 1 (text/plain, inline)]

I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is:

different()
xmemcoll()
memcoll()
strcoll()

so I tried a little test at the strcoll() level:

#include <stdio.h>
#include <unistd.h>
#include <string.h>

int
main (int argc, char **argv)
{
  unsigned char null[] = {

    0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
  };
  unsigned char iraq[] = {
    0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};

  printf("%s\n", null);
  printf("%s\n", iraq);

  int m = strcoll(null, iraq);
  printf("m = %d\n", m);
}

That correctly says the strings are different:

$ LANG=en_US.UTF-8 ./a.out
ⁿᵘˡˡ
ܥܝܪܐܩ
m = 6






> On Dec 16, 2019, at 7:46 PM, Roy Smith <roy <at> panix.com> wrote:
> 
> Yup, this does depend on the locale.  In my original example, I had LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:
> 
>> $ LANG=C.UTF-8 uniq -c x
>>       1 "ⁿᵘˡˡ"
>>       1 "ܥܝܪܐܩ"
> 
> 
> But, that doesn't fully explain what's going on.  I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same.  I've been playing around with the ICU string compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't reproduce this there.  Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate.
>

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Tue, 17 Dec 2019 23:11:02 GMT) Full text and rfc822 format available.

Message #17 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Roy Smith <roy <at> panix.com>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Tue, 17 Dec 2019 15:10:33 -0800

On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 12/15/19 11:40 AM, Roy Smith wrote:
> > With the following input:
> >
> >> $ cat x
> >> "ⁿᵘˡˡ"
> >> "ܥܝܪܐܩ"
> >
> >
> > Running "uniq -c" says there's two copies of the same line!
> >
> >> $ uniq -c x
> >>       2 "ⁿᵘˡˡ"
>
> Thanks for the bug report. I expect this is because GNU 'uniq' uses the
> equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
> macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
> lines compare equal in your locale, GNU 'uniq' says there's just one line.
>
> The GNU 'uniq' behavior appears to be a consequence of this commit:
>
> commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
> Author: Jim Meyering <jim <at> meyering.net>
> Date:   Fri Aug 2 14:42:37 2002 +0000
>
> with a change noted this way in NEWS:
>
> * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.
>
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
> and I expect this means that the 2002 commit should be reverted so that GNU
> 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense anyway).
>
> I'll CC: this email to Jim Meyering to see whether he has an opinion about this.
>
> In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
> of plain 'uniq' in your shell script.

Thanks for the report, Roy, and thanks Paul for diving in.
I confess I haven't done more than look at that old diff, but this
sure sounds like a bug we must fix, to get in line with the the much
more recent POSIX spec.

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Wed, 18 Dec 2019 04:40:02 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: eggert <at> cs.ucla.edu
Cc: bug-coreutils <at> gnu.org, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Wed, 18 Dec 2019 05:39:38 +0100

> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq'

Indeed. The change was done in <http://austingroupbugs.net/view.php?id=963>.
Quote:

"On Page: 3309 Line: 111067 Section: uniq

In the ENVIRONMENT VARIABLES section, delete:

LC_COLLATE

    Determine the locale for ordering rules."

Bruno

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Wed, 18 Dec 2019 04:40:02 GMT) Full text and rfc822 format available.

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Sun, 23 Feb 2020 19:44:01 GMT) Full text and rfc822 format available.

Notification sent to Roy Smith <roy <at> panix.com>:
bug acknowledged by developer. (Sun, 23 Feb 2020 19:44:01 GMT) Full text and rfc822 format available.

Message #28 received at 38627-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Roy Smith <roy <at> panix.com>
Cc: 38627-done <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Sun, 23 Feb 2020 19:43:27 +0000

[Message part 1 (text/plain, inline)]

On 17/12/2019 17:25, Roy Smith wrote:
> I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is:
> 
> different()
> xmemcoll()
> memcoll()
> strcoll()
> 
> so I tried a little test at the strcoll() level:
> 
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> 
> int
> main (int argc, char **argv)
> {
>    unsigned char null[] = {
> 
>      0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
>    };
>    unsigned char iraq[] = {
>      0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};
> 
>    printf("%s\n", null);
>    printf("%s\n", iraq);
> 
>    int m = strcoll(null, iraq);
>    printf("m = %d\n", m);
> }
> 
> That correctly says the strings are different:
> 
> $ LANG=en_US.UTF-8 ./a.out
> ⁿᵘˡˡ
> ܥܝܪܐܩ
> m = 6
> 
> 
> 
> 
> 
> 
>> On Dec 16, 2019, at 7:46 PM, Roy Smith <roy <at> panix.com> wrote:
>>
>> Yup, this does depend on the locale.  In my original example, I had LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:
>>
>>> $ LANG=C.UTF-8 uniq -c x
>>>        1 "ⁿᵘˡˡ"
>>>        1 "ܥܝܪܐܩ"
>>
>>
>> But, that doesn't fully explain what's going on.  I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same.  I've been playing around with the ICU string compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't reproduce this there.  Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate.

I think you ran your test on a newer glibc.
Testing on older glibc-2.22 I see the issue with strcoll() returning 0 for the above strings,
while it returns an expected difference on glibc-2.30 at least.

There are a few things to reason about with removing strcoll(), namely:
  buggy strcoll implementations
  inconsistent unicode normalization
  mismatched locale settings and data
  handling of characters ignored in collation order

tl;dr is that strcoll() should be removed for all these reasons,
and I've added a test for each of the 4 cases above in the attached patch,
which I'll push later.

Marking this as done.

thanks,
Pádraig

[uniq-no-strcoll.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Sun, 23 Feb 2020 20:03:01 GMT) Full text and rfc822 format available.

Message #31 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: 38627 <at> debbugs.gnu.org
Cc: roy <at> panix.com, P <at> draigBrady.com
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Sun, 23 Feb 2020 21:02:30 +0100

On Feb 23 2020, Pádraig Brady wrote:

> On 17/12/2019 17:25, Roy Smith wrote:
>> I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is:
>>
>> different()
>> xmemcoll()
>> memcoll()
>> strcoll()
>>
>> so I tried a little test at the strcoll() level:
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <string.h>
>>
>> int
>> main (int argc, char **argv)
>> {
>>    unsigned char null[] = {
>>
>>      0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
>>    };
>>    unsigned char iraq[] = {
>>      0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};
>>
>>    printf("%s\n", null);
>>    printf("%s\n", iraq);
>>
>>    int m = strcoll(null, iraq);
>>    printf("m = %d\n", m);
>> }

This lacks setlocale.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-coreutils <at> gnu.org:
bug#38627; Package coreutils. (Sun, 23 Feb 2020 23:50:01 GMT) Full text and rfc822 format available.

Message #34 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: P <at> draigBrady.com
Cc: roy <at> panix.com, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Sun, 23 Feb 2020 15:49:46 -0800

On 2/23/20 11:43 AM, Pádraig Brady wrote:

>  #include "hard-locale.h"
>  #include "posixver.h"
>  #include "stdio--.h"
> -#include "xmemcoll.h"

Please also remove the '#include "hard-locale.h"' line.

Thanks for fixing this.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 23 Mar 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 33 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38627 uniq -c gets wrong count with non-ascii strings

GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings