GNU logs - #26362, boring messages


Message sent to bug-coreutils@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#26362: tr -cd -- Problem with UTF-8?
Resent-From: Ronald Schaten <ronald@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Tue, 04 Apr 2017 15:25:02 +0000
Resent-Message-ID: <handler.26362.B.14913194942290 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: report 26362
X-GNU-PR-Package: coreutils
X-GNU-PR-Keywords: 
To: 26362 <at> debbugs.gnu.org
X-Debbugs-Original-To: bug-coreutils@HIDDEN
Received: via spool by submit <at> debbugs.gnu.org id=B.14913194942290
          (code B ref -1); Tue, 04 Apr 2017 15:25:02 +0000
Received: (at submit) by debbugs.gnu.org; 4 Apr 2017 15:24:54 +0000
Received: from localhost ([127.0.0.1]:60828 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cvQKE-0000ar-4L
	for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 11:24:54 -0400
Received: from eggs.gnu.org ([208.118.235.92]:47120)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ronald@HIDDEN>) id 1cvP2G-0005EP-6h
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP25-0007t8-Kg
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:11 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:47829)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ronald@HIDDEN>)
 id 1cvP25-0007t4-Hi
 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 10:02:05 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38087)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP20-0000rU-KZ
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:02:05 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ronald@HIDDEN>) id 1cvP1x-0007qG-JR
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:02:00 -0400
Received: from mail.scheunentor.de ([148.251.13.145]:59619
 helo=ispmail01.scheunentor.de)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ronald@HIDDEN>)
 id 1cvP1x-0007ov-D4
 for bug-coreutils@HIDDEN; Tue, 04 Apr 2017 10:01:57 -0400
Received: from localhost (localhost [127.0.0.1])
 by ispmail01.scheunentor.de (Postfix) with ESMTP id 0F9861F579
 for <bug-coreutils@HIDDEN>; Tue,  4 Apr 2017 16:01:53 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at ispmail01.scheunentor.de
Received: from ispmail01.scheunentor.de ([127.0.0.1])
 by localhost (ispmail01.intra.scheunentor.de [127.0.0.1]) (amavisd-new,
 port 10024) with ESMTP id mRtWWiDgxSao for <bug-coreutils@HIDDEN>;
 Tue,  4 Apr 2017 16:01:50 +0200 (CEST)
Received: from shell.intra.scheunentor.de (shell.intra.scheunentor.de
 [192.168.0.206])
 by ispmail01.scheunentor.de (Postfix) with SMTP id AB1051F548
 for <bug-coreutils@HIDDEN>; Tue,  4 Apr 2017 16:01:50 +0200 (CEST)
Received: (nullmailer pid 27293 invoked by uid 1000);
 Tue, 04 Apr 2017 14:01:52 -0000
Date: Tue, 4 Apr 2017 16:01:52 +0200
From: Ronald Schaten <ronald@HIDDEN>
Message-ID: <20170404140150.GV3709@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -5.0 (-----)
X-Mailman-Approved-At: Tue, 04 Apr 2017 11:24:51 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hey...

I'm not sure if this is bug or if I'm using it wrong. As a matter of
fact, I tested this on several systems, and on BSD-based systems (Mac)
the tr tool gives different results -- the one I expected.

The simplest way to reproduce this looks like this (sorry, umlaut
ahead):

$ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd
% 00000000: c3                                       .

The echo prints a capital A with a circumflex (=C2), and I expect the tr
command to delete everything except the small umlaut =E4. It looks as if
tr just deletes the second byte.

When I try without the umlaut it gives me the empty result, as expected:

$ echo -ne "\xc3\x82" | tr -cd "a" | xxd
[empty result]

I tested several systems, the oldest is a Debian with coreutils 8.5, the
newest an Ubuntu with coreutils 8.25.


For the moment, I'll try to solve my problem differently, but... is this
a bug? Thanks in advance!


Regards,
Ronald.

--=20
There is no reason for any individual to have a computer in his home.
(Ken Olsen, DEC)




Message sent:


Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.505 (Entity 5.505)
Content-Type: text/plain; charset=utf-8
X-Loop: help-debbugs@HIDDEN
From: help-debbugs@HIDDEN (GNU bug Tracking System)
To: Ronald Schaten <ronald@HIDDEN>
Subject: bug#26362: Acknowledgement (tr -cd -- Problem with UTF-8?)
Message-ID: <handler.26362.B.14913194942290.ack <at> debbugs.gnu.org>
References: <20170404140150.GV3709@HIDDEN>
X-Gnu-PR-Message: ack 26362
X-Gnu-PR-Package: coreutils
Reply-To: 26362 <at> debbugs.gnu.org
Date: Tue, 04 Apr 2017 15:25:02 +0000

Thank you for filing a new bug report with debbugs.gnu.org.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 bug-coreutils@HIDDEN

If you wish to submit further information on this problem, please
send it to 26362 <at> debbugs.gnu.org.

Please do not send mail to help-debbugs@HIDDEN unless you wish
to report a problem with the Bug-tracking system.

--=20
26362: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D26362
GNU Bug Tracking System
Contact help-debbugs@HIDDEN with problems


Message sent to bug-coreutils@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#26362: tr -cd -- Problem with UTF-8?
Resent-From: Assaf Gordon <assafgordon@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-coreutils@HIDDEN
Resent-Date: Wed, 05 Apr 2017 02:20:01 +0000
Resent-Message-ID: <handler.26362.B26362.14913587655684 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 26362
X-GNU-PR-Package: coreutils
X-GNU-PR-Keywords: 
To: Ronald Schaten <ronald@HIDDEN>
Cc: 26362 <at> debbugs.gnu.org
Received: via spool by 26362-submit <at> debbugs.gnu.org id=B26362.14913587655684
          (code B ref 26362); Wed, 05 Apr 2017 02:20:01 +0000
Received: (at 26362) by debbugs.gnu.org; 5 Apr 2017 02:19:25 +0000
Received: from localhost ([127.0.0.1]:33017 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cvaXd-0001Ta-3t
	for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 22:19:25 -0400
Received: from mail-qk0-f173.google.com ([209.85.220.173]:36105)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>)
 id 1cvaXb-0001TD-IX; Tue, 04 Apr 2017 22:19:24 -0400
Received: by mail-qk0-f173.google.com with SMTP id p22so55117qka.3;
 Tue, 04 Apr 2017 19:19:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=dP92BkvsTBMJ59+xafaLIyyxvW6wsIHVbplLzaqGPXgDZSsd4nP1j8Td7GZu0GGgBK
 EN3GmXtXmVJctfUC/kLsJpEElOlLySsotdhg1GVDicyeAGTeMN6pGjoRlfOeq3u8en5N
 /baSGNlwD0Wkb7Um1ex2MBguN3lRn6GM7x23zGblyroNESYWDtFVCcrwShor4Ol7GjA4
 1wtA1rZvuepSVlooD6eeCpFz6qssEyWlXGxJRMnUy6b4nVvDHYcW2vjc7yFtOeJloLuu
 NAYOzmpWV5kHiexgGT4/4pWxrlAnY22BNCJargycD/s4aJpTUHXOmuev9POVnZSPeuiD
 kXMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=XIT/biRIYcwzwmvqXrgTz6cjArzS06qwGvq17vwwOU7vAImntRAW14DR2ZBqYwIxto
 1WjJOJjSlyoZVHCInH93ptOnKqUMzc/lhkSloHbaY9ISCCWFBgO6sI2kfdrDaoduzmAX
 98RaKi9S1QqkbSAnHPpXFf2vRrJuk6dmG1ar0HBqOOWhnWU+z9Upd289DP5MRi5yexGq
 BkBfK82FIwAihogA1oplIFBJmBN+C7z3qfdD/zDggUFPVFjuHYc4/hGPuYOrQff9d+5C
 pScVmNCfAjBEkqei8v9IODfPemk9A2YnfcLH4w40CrOJZG+TfW928dgCutzPyGICAklA
 ZxmA==
X-Gm-Message-State: AN3rC/7/cTqfWJlwO1UKKZ5ZT1QGXCaoGMoY1Pa94+dvQZmf5AJEaVKZmhkaIXlEs8InhA==
X-Received: by 10.55.102.193 with SMTP id a184mr315518qkc.309.1491358758055;
 Tue, 04 Apr 2017 19:19:18 -0700 (PDT)
Received: from ix.home (pool-100-37-92-116.nycmny.fios.verizon.net.
 [100.37.92.116])
 by smtp.gmail.com with ESMTPSA id p19sm13168506qtp.36.2017.04.04.19.19.16
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Tue, 04 Apr 2017 19:19:17 -0700 (PDT)
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Content-Type: text/plain; charset=iso-8859-1
From: Assaf Gordon <assafgordon@HIDDEN>
In-Reply-To: <20170404140150.GV3709@HIDDEN>
Date: Tue, 4 Apr 2017 22:19:15 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <50AD3375-F204-4F23-A6EB-6BD3F3A79D4E@HIDDEN>
References: <20170404140150.GV3709@HIDDEN>
X-Mailer: Apple Mail (2.2102)
X-Spam-Score: 0.5 (/)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

tags 26362 notabug wishlist
stop 26362

Hello,

> On Apr 4, 2017, at 10:01, Ronald Schaten <ronald@HIDDEN> =
wrote:
>=20
> I'm not sure if this is bug or if I'm using it wrong.

Neither - it is simply the GNU tr does not yet support multibyte =
characters.

> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
>=20
> $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd
> % 00000000: c3                                       .
>=20
> The echo prints a capital A with a circumflex (=C2), and I expect the =
tr
> command to delete everything except the small umlaut =E4. It looks as =
if
> tr just deletes the second byte.

What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, =
and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 =
encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.

> When I try without the umlaut it gives me the empty result, as =
expected:
>=20
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd

Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.


> For the moment, I'll try to solve my problem differently, but... is =
this
> a bug? Thanks in advance!

Not a bug - but a yet-missing feature.
For relevant discussion see here:
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24924#8

As a temporary work-around, you can use gnu sed which is =
multibyte-aware:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^=E4]//g'
  =E4

And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are =
equivalent to 'a'
will be deleted:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=3Da=3D]]//g'
  a=E4=C2

'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.

Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode =
character,
e.g.:
     printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if =
they
support this character, not just UTF-8).


I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.

regards,
 - assaf





Message received at control <at> debbugs.gnu.org:


Received: (at control) by debbugs.gnu.org; 5 Apr 2017 02:19:25 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 22:19:25 2017
Received: from localhost ([127.0.0.1]:33019 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cvaXd-0001Td-D0
	for submit <at> debbugs.gnu.org; Tue, 04 Apr 2017 22:19:25 -0400
Received: from mail-qk0-f173.google.com ([209.85.220.173]:36105)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>)
 id 1cvaXb-0001TD-IX; Tue, 04 Apr 2017 22:19:24 -0400
Received: by mail-qk0-f173.google.com with SMTP id p22so55117qka.3;
 Tue, 04 Apr 2017 19:19:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=dP92BkvsTBMJ59+xafaLIyyxvW6wsIHVbplLzaqGPXgDZSsd4nP1j8Td7GZu0GGgBK
 EN3GmXtXmVJctfUC/kLsJpEElOlLySsotdhg1GVDicyeAGTeMN6pGjoRlfOeq3u8en5N
 /baSGNlwD0Wkb7Um1ex2MBguN3lRn6GM7x23zGblyroNESYWDtFVCcrwShor4Ol7GjA4
 1wtA1rZvuepSVlooD6eeCpFz6qssEyWlXGxJRMnUy6b4nVvDHYcW2vjc7yFtOeJloLuu
 NAYOzmpWV5kHiexgGT4/4pWxrlAnY22BNCJargycD/s4aJpTUHXOmuev9POVnZSPeuiD
 kXMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:mime-version:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=74jdUgyxWSqjReugYzZWccQe7PW3zsXmh7CvdZDSt3A=;
 b=XIT/biRIYcwzwmvqXrgTz6cjArzS06qwGvq17vwwOU7vAImntRAW14DR2ZBqYwIxto
 1WjJOJjSlyoZVHCInH93ptOnKqUMzc/lhkSloHbaY9ISCCWFBgO6sI2kfdrDaoduzmAX
 98RaKi9S1QqkbSAnHPpXFf2vRrJuk6dmG1ar0HBqOOWhnWU+z9Upd289DP5MRi5yexGq
 BkBfK82FIwAihogA1oplIFBJmBN+C7z3qfdD/zDggUFPVFjuHYc4/hGPuYOrQff9d+5C
 pScVmNCfAjBEkqei8v9IODfPemk9A2YnfcLH4w40CrOJZG+TfW928dgCutzPyGICAklA
 ZxmA==
X-Gm-Message-State: AN3rC/7/cTqfWJlwO1UKKZ5ZT1QGXCaoGMoY1Pa94+dvQZmf5AJEaVKZmhkaIXlEs8InhA==
X-Received: by 10.55.102.193 with SMTP id a184mr315518qkc.309.1491358758055;
 Tue, 04 Apr 2017 19:19:18 -0700 (PDT)
Received: from ix.home (pool-100-37-92-116.nycmny.fios.verizon.net.
 [100.37.92.116])
 by smtp.gmail.com with ESMTPSA id p19sm13168506qtp.36.2017.04.04.19.19.16
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Tue, 04 Apr 2017 19:19:17 -0700 (PDT)
Subject: Re: bug#26362: tr -cd -- Problem with UTF-8?
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Content-Type: text/plain; charset=iso-8859-1
From: Assaf Gordon <assafgordon@HIDDEN>
In-Reply-To: <20170404140150.GV3709@HIDDEN>
Date: Tue, 4 Apr 2017 22:19:15 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <50AD3375-F204-4F23-A6EB-6BD3F3A79D4E@HIDDEN>
References: <20170404140150.GV3709@HIDDEN>
To: Ronald Schaten <ronald@HIDDEN>
X-Mailer: Apple Mail (2.2102)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: control
Cc: 26362 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

tags 26362 notabug wishlist
stop 26362

Hello,

> On Apr 4, 2017, at 10:01, Ronald Schaten <ronald@HIDDEN> =
wrote:
>=20
> I'm not sure if this is bug or if I'm using it wrong.

Neither - it is simply the GNU tr does not yet support multibyte =
characters.

> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
>=20
> $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd
> % 00000000: c3                                       .
>=20
> The echo prints a capital A with a circumflex (=C2), and I expect the =
tr
> command to delete everything except the small umlaut =E4. It looks as =
if
> tr just deletes the second byte.

What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, =
and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 =
encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.

> When I try without the umlaut it gives me the empty result, as =
expected:
>=20
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd

Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.


> For the moment, I'll try to solve my problem differently, but... is =
this
> a bug? Thanks in advance!

Not a bug - but a yet-missing feature.
For relevant discussion see here:
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24924#8

As a temporary work-around, you can use gnu sed which is =
multibyte-aware:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^=E4]//g'
  =E4

And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are =
equivalent to 'a'
will be deleted:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=3Da=3D]]//g'
  a=E4=C2

'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.

Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode =
character,
e.g.:
     printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if =
they
support this character, not just UTF-8).


I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.

regards,
 - assaf





Message received at control <at> debbugs.gnu.org:


Received: (at control) by debbugs.gnu.org; 29 Oct 2018 03:04:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Oct 28 23:04:04 2018
Received: from localhost ([127.0.0.1]:49681 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1gGxqW-00053j-LB
	for submit <at> debbugs.gnu.org; Sun, 28 Oct 2018 23:04:04 -0400
Received: from mail-pg1-f178.google.com ([209.85.215.178]:38421)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>) id 1gGxqV-00052b-08
 for control <at> debbugs.gnu.org; Sun, 28 Oct 2018 23:04:03 -0400
Received: by mail-pg1-f178.google.com with SMTP id f8-v6so3159499pgq.5
 for <control <at> debbugs.gnu.org>; Sun, 28 Oct 2018 20:04:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=to:from:message-id:date:user-agent:mime-version:content-language
 :content-transfer-encoding;
 bh=fH6hAKZNEGf1YGpGk9nb5QX4JVfsrwpOAZT2F7WSURI=;
 b=FTpqkPUvrZdkaqNFKBHYQOcZc3ZiSwwQ4V/BdYvlA9IqlNwW6gNPbF76di+Gu0mVgX
 cf67Be7GNniA2VI7qofO+1HP09fnJ1Q2hK3TVn27K78I+C/zNyXLIwhAKP963LQHgUoJ
 9XMIY8cOXV5TXYEQ6RT0nD8UNZ9w5ZTh/2Obrclqp0efXwXpbAlCpgqiMbv6MHIcDll/
 YULyne2atUGo9oHE4Acz1gZUHjnQIkImsLuqK+PY54PNrNB8tSU3VSPbG5VFK3HMZiiy
 q67lveiIC9Sco/k8gD0cyx+0T7oKK22tyG7J9n+c3ySdchF38NHWHCG5je+JO4vdNYJ2
 TC2g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:to:from:message-id:date:user-agent:mime-version
 :content-language:content-transfer-encoding;
 bh=fH6hAKZNEGf1YGpGk9nb5QX4JVfsrwpOAZT2F7WSURI=;
 b=VJRkxxWEtm146tkLieQfU6nMnpqM9DMYl8dOiXGn3/UvqEI2AJ3ilFaAPV3I3sR7kL
 A7pyFoEouDgQ2Xl7OHKmqSF+fuc0baHxB9FZzG8YmUAD8jmpKRh6hGMC1mn/PQoppbBo
 Q5VSacSj3vN4vcabDLTLudcVpk1JtP76Kb5B+9OGw5OpXMKwN/tYaBbEEhADv1JGmvyp
 XqlNinX5LyWzSlyNycQq0J4JPElxScIKD00+mE6vgNyYgtyy2GBlUDyWV3beFHHz96nC
 5GUHFCzAqJCkWehk+p4N0BRKQ6xjebsEGZuGQL8yOv+tjDnpEPoKa17BBgPkuyHdSp56
 1+og==
X-Gm-Message-State: AGRZ1gJ69S43IzxT1v73MlIvAIVsixm4ik4SHM5FvB+sOp1bd3z4adh9
 iD4dBxRNP0pCqW7oAYtF9BGIn4FSIt4=
X-Google-Smtp-Source: AJdET5cksNofN1TXHqTExk/X1hjdK5Po0nfGepurd+u+RqvrZtebyvFDDZMxqBKqnxckHfEhw9B7/g==
X-Received: by 2002:a62:34c5:: with SMTP id
 b188-v6mr13784982pfa.65.1540782236611; 
 Sun, 28 Oct 2018 20:03:56 -0700 (PDT)
Received: from tomato.housegordon.com (moose.housegordon.com. [184.68.105.38])
 by smtp.googlemail.com with ESMTPSA id
 t11-v6sm22307330pgn.38.2018.10.28.20.03.53
 for <control <at> debbugs.gnu.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sun, 28 Oct 2018 20:03:55 -0700 (PDT)
To: control <at> debbugs.gnu.org
From: Assaf Gordon <assafgordon@HIDDEN>
Message-ID: <d5241edc-7e81-c07d-0c69-5c25e42f83ff@HIDDEN>
Date: Sun, 28 Oct 2018 21:03:52 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview: severity 26362 wishlist retitle 26362 multibyte: tr: "tr -cd"
 -- Problem with UTF-8? [...] 
 Content analysis details:   (2.0 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
 (assafgordon[at]gmail.com)
 -0.0 SPF_PASS               SPF: sender matches SPF record
 -0.0 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
 trust [209.85.215.178 listed in list.dnswl.org]
 1.8 MISSING_SUBJECT        Missing Subject: header
 0.2 NO_SUBJECT             Extra score for no subject
X-Debbugs-Envelope-To: control
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

severity 26362 wishlist
retitle 26362 multibyte: tr: "tr -cd" -- Problem with UTF-8?




Message received at control <at> debbugs.gnu.org:


Received: (at control) by debbugs.gnu.org; 29 Oct 2018 03:04:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Oct 28 23:04:04 2018
Received: from localhost ([127.0.0.1]:49681 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1gGxqW-00053j-LB
	for submit <at> debbugs.gnu.org; Sun, 28 Oct 2018 23:04:04 -0400
Received: from mail-pg1-f178.google.com ([209.85.215.178]:38421)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>) id 1gGxqV-00052b-08
 for control <at> debbugs.gnu.org; Sun, 28 Oct 2018 23:04:03 -0400
Received: by mail-pg1-f178.google.com with SMTP id f8-v6so3159499pgq.5
 for <control <at> debbugs.gnu.org>; Sun, 28 Oct 2018 20:04:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=to:from:message-id:date:user-agent:mime-version:content-language
 :content-transfer-encoding;
 bh=fH6hAKZNEGf1YGpGk9nb5QX4JVfsrwpOAZT2F7WSURI=;
 b=FTpqkPUvrZdkaqNFKBHYQOcZc3ZiSwwQ4V/BdYvlA9IqlNwW6gNPbF76di+Gu0mVgX
 cf67Be7GNniA2VI7qofO+1HP09fnJ1Q2hK3TVn27K78I+C/zNyXLIwhAKP963LQHgUoJ
 9XMIY8cOXV5TXYEQ6RT0nD8UNZ9w5ZTh/2Obrclqp0efXwXpbAlCpgqiMbv6MHIcDll/
 YULyne2atUGo9oHE4Acz1gZUHjnQIkImsLuqK+PY54PNrNB8tSU3VSPbG5VFK3HMZiiy
 q67lveiIC9Sco/k8gD0cyx+0T7oKK22tyG7J9n+c3ySdchF38NHWHCG5je+JO4vdNYJ2
 TC2g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:to:from:message-id:date:user-agent:mime-version
 :content-language:content-transfer-encoding;
 bh=fH6hAKZNEGf1YGpGk9nb5QX4JVfsrwpOAZT2F7WSURI=;
 b=VJRkxxWEtm146tkLieQfU6nMnpqM9DMYl8dOiXGn3/UvqEI2AJ3ilFaAPV3I3sR7kL
 A7pyFoEouDgQ2Xl7OHKmqSF+fuc0baHxB9FZzG8YmUAD8jmpKRh6hGMC1mn/PQoppbBo
 Q5VSacSj3vN4vcabDLTLudcVpk1JtP76Kb5B+9OGw5OpXMKwN/tYaBbEEhADv1JGmvyp
 XqlNinX5LyWzSlyNycQq0J4JPElxScIKD00+mE6vgNyYgtyy2GBlUDyWV3beFHHz96nC
 5GUHFCzAqJCkWehk+p4N0BRKQ6xjebsEGZuGQL8yOv+tjDnpEPoKa17BBgPkuyHdSp56
 1+og==
X-Gm-Message-State: AGRZ1gJ69S43IzxT1v73MlIvAIVsixm4ik4SHM5FvB+sOp1bd3z4adh9
 iD4dBxRNP0pCqW7oAYtF9BGIn4FSIt4=
X-Google-Smtp-Source: AJdET5cksNofN1TXHqTExk/X1hjdK5Po0nfGepurd+u+RqvrZtebyvFDDZMxqBKqnxckHfEhw9B7/g==
X-Received: by 2002:a62:34c5:: with SMTP id
 b188-v6mr13784982pfa.65.1540782236611; 
 Sun, 28 Oct 2018 20:03:56 -0700 (PDT)
Received: from tomato.housegordon.com (moose.housegordon.com. [184.68.105.38])
 by smtp.googlemail.com with ESMTPSA id
 t11-v6sm22307330pgn.38.2018.10.28.20.03.53
 for <control <at> debbugs.gnu.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sun, 28 Oct 2018 20:03:55 -0700 (PDT)
To: control <at> debbugs.gnu.org
From: Assaf Gordon <assafgordon@HIDDEN>
Message-ID: <d5241edc-7e81-c07d-0c69-5c25e42f83ff@HIDDEN>
Date: Sun, 28 Oct 2018 21:03:52 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview: severity 26362 wishlist retitle 26362 multibyte: tr: "tr -cd"
 -- Problem with UTF-8? [...] 
 Content analysis details:   (2.0 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
 (assafgordon[at]gmail.com)
 -0.0 SPF_PASS               SPF: sender matches SPF record
 -0.0 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
 trust [209.85.215.178 listed in list.dnswl.org]
 1.8 MISSING_SUBJECT        Missing Subject: header
 0.2 NO_SUBJECT             Extra score for no subject
X-Debbugs-Envelope-To: control
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

severity 26362 wishlist
retitle 26362 multibyte: tr: "tr -cd" -- Problem with UTF-8?





Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.