GNU bug report logs - #26362
multibyte: tr: "tr -cd" -- Problem with UTF-8?

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Ronald Schaten <ronald@HIDDEN>; Keywords: notabug; dated Tue, 4 Apr 2017 15:25:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: tr: "tr -cd" -- Problem with UTF-8?' from 'tr -cd -- Problem with UTF-8?' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Added tag(s) notabug. Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 26362 <at> debbugs.gnu.org:


Received: (at 26362) by debbugs.gnu.org; 5 Apr 2017 02:19:25 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 22:19:25 2017
Received: from localhost ([127.0.0.1]:33017 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cvaXd-0001Ta-3t
	for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 22:19:25 -0400
Received: from mail-qk0-f173.google.com ([209.85.220.173]:36105)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>)
 id 1cvaXb-0001TD-IX; Tue, 04 Apr 2017 22:19:24 -0400
Received: by mail-qk0-f173.google.com with SMTP id p22so55117qka.3;
 Tue, 04 Apr 2017 19:19:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=dP92BkvsTBMJ59+xafaLIyyxvW6wsIHVbplLzaqGPXgDZSsd4nP1j8Td7GZu0GGgBK
 EN3GmXtXmVJctfUC/kLsJpEElOlLySsotdhg1GVDicyeAGTeMN6pGjoRlfOeq3u8en5N
 /baSGNlwD0Wkb7Um1ex2MBguN3lRn6GM7x23zGblyroNESYWDtFVCcrwShor4Ol7GjA4
 1wtA1rZvuepSVlooD6eeCpFz6qssEyWlXGxJRMnUy6b4nVvDHYcW2vjc7yFtOeJloLuu
 NAYOzmpWV5kHiexgGT4/4pWxrlAnY22BNCJargycD/s4aJpTUHXOmuev9POVnZSPeuiD
 kXMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=XIT/biRIYcwzwmvqXrgTz6cjArzS06qwGvq17vwwOU7vAImntRAW14DR2ZBqYwIxto
 1WjJOJjSlyoZVHCInH93ptOnKqUMzc/lhkSloHbaY9ISCCWFBgO6sI2kfdrDaoduzmAX
 98RaKi9S1QqkbSAnHPpXFf2vRrJuk6dmG1ar0HBqOOWhnWU+z9Upd289DP5MRi5yexGq
 BkBfK82FIwAihogA1oplIFBJmBN+C7z3qfdD/zDggUFPVFjuHYc4/hGPuYOrQff9d+5C
 pScVmNCfAjBEkqei8v9IODfPemk9A2YnfcLH4w40CrOJZG+TfW928dgCutzPyGICAklA
 ZxmA==
X-Gm-Message-State: AN3rC/7/cTqfWJlwO1UKKZ5ZT1QGXCaoGMoY1Pa94+dvQZmf5AJEaVKZmhkaIXlEs8InhA==
X-Received: by 10.55.102.193 with SMTP id a184mr315518qkc.309.1491358758055;
 Tue, 04 Apr 2017 19:19:18 -0700 (PDT)
Received: from ix.home (pool-100-37-92-116.nycmny.fios.verizon.net.
 [100.37.92.116])
 by smtp.gmail.com with ESMTPSA id p19sm13168506qtp.36.2017.04.04.19.19.16
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Tue, 04 Apr 2017 19:19:17 -0700 (PDT)
Subject: Re: bug#26362: tr -cd -- Problem with UTF-8?
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Content-Type: text/plain; charset=iso-8859-1
From: Assaf Gordon <assafgordon@HIDDEN>
In-Reply-To: <20170404140150.GV3709@HIDDEN>
Date: Tue, 4 Apr 2017 22:19:15 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <50AD3375-F204-4F23-A6EB-6BD3F3A79D4E@HIDDEN>
References: <20170404140150.GV3709@HIDDEN>
To: Ronald Schaten <ronald@HIDDEN>
X-Mailer: Apple Mail (2.2102)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 26362
Cc: 26362 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

tags 26362 notabug wishlist
stop 26362

Hello,

> On Apr 4, 2017, at 10:01, Ronald Schaten <ronald@HIDDEN> =
wrote:
>=20
> I'm not sure if this is bug or if I'm using it wrong.

Neither - it is simply the GNU tr does not yet support multibyte =
characters.

> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
>=20
> $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd
> % 00000000: c3                                       .
>=20
> The echo prints a capital A with a circumflex (=C2), and I expect the =
tr
> command to delete everything except the small umlaut =E4. It looks as =
if
> tr just deletes the second byte.

What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, =
and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 =
encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.

> When I try without the umlaut it gives me the empty result, as =
expected:
>=20
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd

Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.


> For the moment, I'll try to solve my problem differently, but... is =
this
> a bug? Thanks in advance!

Not a bug - but a yet-missing feature.
For relevant discussion see here:
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24924#8

As a temporary work-around, you can use gnu sed which is =
multibyte-aware:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^=E4]//g'
  =E4

And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are =
equivalent to 'a'
will be deleted:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=3Da=3D]]//g'
  a=E4=C2

'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.

Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode =
character,
e.g.:
     printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if =
they
support this character, not just UTF-8).


I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.

regards,
 - assaf





Information forwarded to bug-coreutils@HIDDEN:
bug#26362; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 4 Apr 2017 15:24:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 11:24:54 2017
Received: from localhost ([127.0.0.1]:60828 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cvQKE-0000ar-4L
	for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 11:24:54 -0400
Received: from eggs.gnu.org ([208.118.235.92]:47120)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ronald@HIDDEN>) id 1cvP2G-0005EP-6h
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP25-0007t8-Kg
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:11 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:47829)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ronald@HIDDEN>)
 id 1cvP25-0007t4-Hi
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:05 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38087)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP20-0000rU-KZ
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:02:05 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP1x-0007qG-JR
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:02:00 -0400
Received: from mail.scheunentor.de ([148.251.13.145]:59619
 helo=ispmail01.scheunentor.de)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ronald@HIDDEN>)
 id 1cvP1x-0007ov-D4
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:01:57 -0400
Received: from localhost (localhost [127.0.0.1])
 by ispmail01.scheunentor.de (Postfix) with ESMTP id 0F9861F579
 for <bug-coreutils@HIDDEN>; Tue,  4 Apr 2017 16:01:53 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at ispmail01.scheunentor.de
Received: from ispmail01.scheunentor.de ([127.0.0.1])
 by localhost (ispmail01.intra.scheunentor.de [127.0.0.1]) (amavisd-new,
 port 10024) with ESMTP id mRtWWiDgxSao for <bug-coreutils@HIDDEN>;
 Tue,  4 Apr 2017 16:01:50 +0200 (CEST)
Received: from shell.intra.scheunentor.de (shell.intra.scheunentor.de
 [192.168.0.206])
 by ispmail01.scheunentor.de (Postfix) with SMTP id AB1051F548
 for <bug-coreutils@HIDDEN>; Tue,  4 Apr 2017 16:01:50 +0200 (CEST)
Received: (nullmailer pid 27293 invoked by uid 1000);
 Tue, 04 Apr 2017 14:01:52 -0000
Date: Tue, 4 Apr 2017 16:01:52 +0200
From: Ronald Schaten <ronald@HIDDEN>
To: bug-coreutils@HIDDEN
Subject: tr -cd -- Problem with UTF-8?
Message-ID: <20170404140150.GV3709@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Tue, 04 Apr 2017 11:24:51 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hey...

I'm not sure if this is bug or if I'm using it wrong. As a matter of
fact, I tested this on several systems, and on BSD-based systems (Mac)
the tr tool gives different results -- the one I expected.

The simplest way to reproduce this looks like this (sorry, umlaut
ahead):

$ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd
% 00000000: c3                                       .

The echo prints a capital A with a circumflex (=C2), and I expect the tr
command to delete everything except the small umlaut =E4. It looks as if
tr just deletes the second byte.

When I try without the umlaut it gives me the empty result, as expected:

$ echo -ne "\xc3\x82" | tr -cd "a" | xxd
[empty result]

I tested several systems, the oldest is a Debian with coreutils 8.5, the
newest an Ubuntu with coreutils 8.25.


For the moment, I'll try to solve my problem differently, but... is this
a bug? Thanks in advance!


Regards,
Ronald.

--=20
There is no reason for any individual to have a computer in his home.
(Ken Olsen, DEC)




Acknowledgement sent to Ronald Schaten <ronald@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to bug-coreutils@HIDDEN:
bug#26362; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.