GNU bug report logs - #11967
Bug in "uniq"

Previous Next

Package: coreutils;

Reported by: Jaime Gaspar <mail <at> jaimegaspar.com>

Date: Tue, 17 Jul 2012 21:30:02 UTC

Severity: normal

Tags: notabug

Merged with 11968

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11967 in the body.
You can then email your comments to 11967 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#11967; Package coreutils. (Tue, 17 Jul 2012 21:30:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jaime Gaspar <mail <at> jaimegaspar.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 17 Jul 2012 21:30:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jaime Gaspar <mail <at> jaimegaspar.com>
To: bug-coreutils <at> gnu.org
Subject: Bug in "uniq"
Date: Tue, 17 Jul 2012 10:17:43 -0800
Dear Sir or Madam,

I think that there is a bug in "uniq" (version 8.13).

The file "bug.txt" attached consists of two lines:
- the first one containing a character that
  looks like a "v" and a line break;
- the second one containing a character that
  looks like a upside down "v" and a line break.
In hex:

    E2 88 A8  0A
    E2 88 A7  0A

When we run "uniq bug.txt" in a terminal, "uniq" outputs a single line, so "uniq" thinks that the two lines are equal, but they are not.

Regards,
Jaime Gaspar
_____________________________
Homepage: www.jaimegaspar.com
E-mail: mail <at> jaimegaspar.com

____________________________________________________________
Send any screenshot to your friends in seconds...
Works in all emails, instant messengers, blogs, forums and social networks.
TRY IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if2 for FREE






Forcibly Merged 11967 11968. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 17 Jul 2012 21:56:01 GMT) Full text and rfc822 format available.

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 17 Jul 2012 21:56:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 17 Jul 2012 21:56:02 GMT) Full text and rfc822 format available.

Notification sent to Jaime Gaspar <mail <at> jaimegaspar.com>:
bug acknowledged by developer. (Tue, 17 Jul 2012 21:56:02 GMT) Full text and rfc822 format available.

Message #14 received at 11967-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Jaime Gaspar <mail <at> jaimegaspar.com>
Cc: control <at> debbugs.gnu.org, 11967-done <at> debbugs.gnu.org
Subject: Re: bug#11967: Bug in "uniq"
Date: Tue, 17 Jul 2012 15:49:38 -0600
[Message part 1 (text/plain, inline)]
forcemerge 11967 11968
tag 11967 notabug
thanks

On 07/17/2012 12:17 PM, Jaime Gaspar wrote:
> I think that there is a bug in "uniq" (version 8.13).

Is this your distro's build?  However, I repeated your claim with the
latest coreutils.git (post-8.17)., so this is not likely to be a bug in
a distro-specific multibyte patch.

> 
> The file "bug.txt" attached consists of two lines:
> - the first one containing a character that
>   looks like a "v" and a line break;
> - the second one containing a character that
>   looks like a upside down "v" and a line break.
> In hex:
> 
>     E2 88 A8  0A
>     E2 88 A7  0A

Those glyphs that you describe line up with Unicode characters.  I bet
you are using a locale with UTF-8 character encoding.

> 
> When we run "uniq bug.txt" in a terminal, "uniq" outputs a single line, so "uniq" thinks that the two lines are equal, but they are not.

I can reproduce your symptoms, but only when I fudge my locale:

$ LC_ALL=C uniq ../bug.txt
∨
∧
$ LC_ALL=en_US.UTF-8 uniq ../bug.txt
∨
$

Remember, 'uniq' is required by POSIX to use the same line comparison
techniques as 'sort'; and 'sort' is required to use strcoll() (not
strcmp) to compare lines.  And in your particular choice of locale,
strcoll() happens to state that '∨' and '∧' collate identically; hence
uniq is correct in stating that you have a duplicated line according to
your current locale.

$ LC_ALL=en_US.UTF-8 sort ../bug.txt -u --debug
sort: using ‘en_US.UTF-8’ sorting rules
∨
_
$

So I'm closing this as not a bug, along with a final pointer to our FAQ:

https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org



[signature.asc (application/pgp-signature, attachment)]

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 17 Jul 2012 21:56:02 GMT) Full text and rfc822 format available.

Notification sent to Jaime Gaspar <mail <at> jaimegaspar.com>:
bug acknowledged by developer. (Tue, 17 Jul 2012 21:56:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 15 Aug 2012 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 257 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.