Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Pádraig Brady <P@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Jim Meyering <jim@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at submit) by debbugs.gnu.org; 27 Feb 2012 06:15:17 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Feb 27 01:15:16 2012 Received: from localhost ([127.0.0.1]:58761 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1S1trg-0000ks-KR for submit <at> debbugs.gnu.org; Mon, 27 Feb 2012 01:15:16 -0500 Received: from eggs.gnu.org ([208.118.235.92]:49075) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <cjns1989@HIDDEN>) id 1S1tre-0000kh-BE for submit <at> debbugs.gnu.org; Mon, 27 Feb 2012 01:15:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <cjns1989@HIDDEN>) id 1S1tot-0003Md-JF for submit <at> debbugs.gnu.org; Mon, 27 Feb 2012 01:12:24 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=BAYES_00,BODY_8BITS, FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM autolearn=no version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:36939) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <cjns1989@HIDDEN>) id 1S1tot-0003Ln-FG for submit <at> debbugs.gnu.org; Mon, 27 Feb 2012 01:12:23 -0500 Received: from eggs.gnu.org ([208.118.235.92]:46282) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <cjns1989@HIDDEN>) id 1S1tor-0000nR-I6 for bug-coreutils@HIDDEN; Mon, 27 Feb 2012 01:12:22 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <cjns1989@HIDDEN>) id 1S1tON-0005GB-KO for bug-coreutils@HIDDEN; Mon, 27 Feb 2012 00:45:00 -0500 Received: from mta1.srv.hcvlny.cv.net ([167.206.4.196]:55790) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <cjns1989@HIDDEN>) id 1S1tON-0005FR-Gj for bug-coreutils@HIDDEN; Mon, 27 Feb 2012 00:44:59 -0500 Received: from pavo.local (ool-457112ca.dyn.optonline.net [69.113.18.202]) by mta1.srv.hcvlny.cv.net (Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007)) with ESMTP id <0M0100JJ5EMX5NB0@HIDDEN> for bug-coreutils@HIDDEN; Mon, 27 Feb 2012 00:44:58 -0500 (EST) Received: from gavron by pavo.local with local (Exim 4.69) (envelope-from <cjns1989@HIDDEN>) id 1S1tOL-0004lo-1S for bug-coreutils@HIDDEN; Mon, 27 Feb 2012 00:44:57 -0500 Date: Mon, 27 Feb 2012 00:44:56 -0500 From: Chris Jones <cjns1989@HIDDEN> Subject: Re: bug#10880: instead of characters, tr works on bytes In-reply-to: <20120224142912.107150@HIDDEN> To: bug-coreutils@HIDDEN Mail-followup-to: bug-coreutils@HIDDEN Message-id: <20120227054456.GA3559@HIDDEN> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: QUOTED-PRINTABLE Content-disposition: inline References: <20120224142912.107150@HIDDEN> User-Agent: Mutt/1.5.18 (2008-05-17) X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 208.118.235.17 X-Spam-Score: 1.8 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Fri, Feb 24, 2012 at 09:29:12AM EST, Marton Kadar wrote: [..] > > $ set | grep ^L > > LANG=hu_HU.UTF-8 > > LC_ALL=hu_HU.UTF-8 > > LINES=73 > > LOGNAME=kadar1marto518 > > > > Now let's see the bytestream for the following string > > (which means flood in Hungarian): > > > > $ echo árvÃz | od -c > > 0000000 303 241  r  v 303 255  z  \n > > 0000010 > > > > Let us try to delete a character and see if it worked: > > > > $ echo árvÃz | tr -d á | od -c > > 0000000  r  v 255  z  \n > > 0000005 [...] Content analysis details: (1.8 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (cjns1989[at]gmail.com) 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [208.118.235.92 listed in psbl.surriel.com] 0.8 SPF_NEUTRAL SPF: sender does not match SPF record (neutral) 0.2 FREEMAIL_ENVFROM_END_DIGIT Envelope-from freemail username ends in digit (cjns1989[at]gmail.com) -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Sender: debbugs-submit-bounces <at> debbugs.gnu.org Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org X-Spam-Score: 1.8 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Fri, Feb 24, 2012 at 09:29:12AM EST, Marton Kadar wrote: [..] > > $ set | grep ^L > > LANG=hu_HU.UTF-8 > > LC_ALL=hu_HU.UTF-8 > > LINES=73 > > LOGNAME=kadar1marto518 > > > > Now let's see the bytestream for the following string > > (which means flood in Hungarian): > > > > $ echo árvÃz | od -c > > 0000000 303 241  r  v 303 255  z  \n > > 0000010 > > > > Let us try to delete a character and see if it worked: > > > > $ echo árvÃz | tr -d á | od -c > > 0000000  r  v 255  z  \n > > 0000005 [...] Content analysis details: (1.8 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (cjns1989[at]gmail.com) 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [208.118.235.92 listed in psbl.surriel.com] 0.8 SPF_NEUTRAL SPF: sender does not match SPF record (neutral) 0.2 FREEMAIL_ENVFROM_END_DIGIT Envelope-from freemail username ends in digit (cjns1989[at]gmail.com) -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] On Fri, Feb 24, 2012 at 09:29:12AM EST, Marton Kadar wrote: [..] > > $ set | grep ^L > > LANG=3Dhu_HU.UTF-8 > > LC_ALL=3Dhu_HU.UTF-8 > > LINES=3D73 > > LOGNAME=3Dkadar1marto518 > >=20 > > Now let's see the bytestream for the following string > > (which means flood in Hungarian): > >=20 > > $ echo =C3=A1rv=C3=ADz | od -c > > 0000000 303 241 =C2=A0 r =C2=A0 v 303 255 =C2=A0 z =C2=A0\n > > 0000010 > >=20 > > Let us try to delete a character and see if it worked: > >=20 > > $ echo =C3=A1rv=C3=ADz | tr -d =C3=A1 | od -c > > 0000000 =C2=A0 r =C2=A0 v 255 =C2=A0 z =C2=A0\n > > 0000005 [..] Try this for size... $ echo =C3=A1rv=C3=ADz | od -t x1z -w16=20 $ echo =C3=A1rv=C3=ADz | tr -d =C3=A9 | od -t x1z -w16=20 $ echo =C3=A1rv=C3=ADz | tr -d =C3=A9 > /tmp/u.txt $ isutf8 /tmp/u.txt And there is not even an =E2=80=98=C3=A9=E2=80=99 in =E2=80=98=C3= =A1rv=C3=ADz=E2=80=99.. CJ P.S. Though you do have to look for it a bit, the coreutils manual clearly states that only single-byte encodings are supported:=20 http://www.gnu.org/software/coreutils/manual/html_node/tr-invocation.= html --=20 Mooo Canada!!!!
bug-coreutils@HIDDEN
:bug#10880
; Package coreutils
.
Full text available.Received: (at 10880) by debbugs.gnu.org; 25 Feb 2012 23:23:29 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Feb 25 18:23:29 2012 Received: from localhost ([127.0.0.1]:56518 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1S1Qxc-0004W2-Tb for submit <at> debbugs.gnu.org; Sat, 25 Feb 2012 18:23:29 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:50255) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <eggert@HIDDEN>) id 1S1Qxa-0004Vr-1x for 10880 <at> debbugs.gnu.org; Sat, 25 Feb 2012 18:23:27 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 58F5139E800E; Sat, 25 Feb 2012 15:20:43 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vv629Xoi7A4R; Sat, 25 Feb 2012 15:20:43 -0800 (PST) Received: from [192.168.1.10] (pool-71-189-109-235.lsanca.fios.verizon.net [71.189.109.235]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id E9BF739E800C; Sat, 25 Feb 2012 15:20:42 -0800 (PST) Message-ID: <4F496CCC.5010408@HIDDEN> Date: Sat, 25 Feb 2012 15:20:44 -0800 From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Marton Kadar <marton.kadar@HIDDEN> Subject: Re: bug#10880: instead of characters, tr works on bytes References: <20120224142912.107150@HIDDEN> <20120225220727.107140@HIDDEN> In-Reply-To: <20120225220727.107140@HIDDEN> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 10880 Cc: 10880 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Sender: debbugs-submit-bounces <at> debbugs.gnu.org Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org X-Spam-Score: -1.9 (-) On 02/25/2012 02:07 PM, Marton Kadar wrote: > the execution path (sigle byte specific or generalized > multibyte capable) can be determined at program startup, so in the > worst case there can be a tr and a tr-slow-but-multibyte version, > former calling the latter when so directed by the locale settings. Something like that should work, yes. Unfortunately so far nobody has volunteered to do it. The task would not be trivial. We don't want to maintain two copies of the code, one for single-byte and one for multibyte, as that'd be a maintenance problem. Instead, we'd like to have just one copy of the code, which is easy to read and which compiles into either unibyte or multibyte versions. > avoiding a solely performance related penalty in text handling > command line utilities can never be a justifiable reason for > incorrect functionality. As far as I know there is no requirement in POSIX that applications must support multibyte locales, and there's no documentation claiming that the utilities in question support multibyte location, so this is not a bug; it's a feature request. My opinion about this may be colored by an experience I had yesterday with the latest version of GNU sed. Single-byte it worked fine; multibyte it was so slow that I gave up. We don't want this to happen with the core utilities.
bug-coreutils@HIDDEN
:bug#10880
; Package coreutils
.
Full text available.Received: (at 10880) by debbugs.gnu.org; 25 Feb 2012 22:10:15 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Feb 25 17:10:14 2012 Received: from localhost ([127.0.0.1]:56466 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1S1Pok-0002qJ-KA for submit <at> debbugs.gnu.org; Sat, 25 Feb 2012 17:10:14 -0500 Received: from mailout-us.gmx.com ([74.208.5.67]:47702) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from <marton.kadar@HIDDEN>) id 1S1Poi-0002q9-Ea for 10880 <at> debbugs.gnu.org; Sat, 25 Feb 2012 17:10:13 -0500 Received: (qmail 10440 invoked by uid 0); 25 Feb 2012 22:07:29 -0000 Received: from 79.122.6.148 by rms-us009.v300.gmx.net with HTTP Content-Type: text/plain; charset="utf-8" Date: Sat, 25 Feb 2012 17:07:27 -0500 From: "Marton Kadar" <marton.kadar@HIDDEN> Message-ID: <20120225220727.107140@HIDDEN> MIME-Version: 1.0 Subject: Re: bug#10880: instead of characters, tr works on bytes To: 10880 <at> debbugs.gnu.org X-Authenticated: #77717673 X-Flags: 0001 X-Mailer: GMX.com Web Mailer x-registered: 0 Content-Transfer-Encoding: 8bit X-GMX-UID: yD47b75I3zOlOMiDynAh9Pt+IGRvbwAE X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 10880 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Sender: debbugs-submit-bounces <at> debbugs.gnu.org Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org X-Spam-Score: -1.9 (-) > ----- Original Message ----- > From: Eric Blake > Sent: 02/25/12 04:28 AM > To: Marton Kadar > Subject: Re: bug#10880: instead of characters, tr works on bytes > > On 02/24/2012 07:29 AM, Marton Kadar wrote: > > Don't know which is the official way to report a bug in 'tr' > > so I will copy to this list too. CC me on replies as I am not > > subscribing. > > Sending mail to coreutils@HIDDEN _is_ what creates a bug on > debbugs.gnu.org, so you have managed to create a duplicate. Paul Eggert > has already merged 9365, 10880, and 9569, so now, replying to any one of > those three is merely adding information to the same report. > > >> > >> Let us try to delete a character and see if it worked: > >> > >> $ echo árvÃz | tr -d á | od -c > >> 0000000 r v 255 z \n > >> 0000005 > > Please keep in mind that upstream coreutils is not yet converted over to > multibyte support. This is evidence of one of the places that multibyte > support is required, and therefore, where you cannot expect things to > work yet. No one has yet contributed a maintainable patch that does not > penalize single-byte locales, at least not upstream. Several distros > have their own UTF-8 patches that they apply, but then, this would be a > bug you report to your distro and not upstream. > > >> I'll check the source for tr myself although never coded in C. > >> This should be a trivial fix. > > Alas, dealing with multibyte characters without penalizing single-byte > locales is NOT trivial, or it would have been done long ago. "Penalizing" single-byte locales - did you mean in performance or in functionality? I understand that a generalized algorithm would probably be slower than one tuned for the single byte case. But I suspect that you are also referring to some functional implication, as avoiding a solely performance related penalty in text handling command line utilities can never be a justifiable reason for incorrect functionality. Besides, the execution path (sigle byte specific or generalized multibyte capable) can be determined at program startup, so in the worst case there can be a tr and a tr-slow-but-multibyte version, former calling the latter when so directed by the locale settings. A minimal "solution" could also be to put a warning on each affected program's man page: "Multibyte locales currently unsupported!". It is not always immediately apparent, what the problem is, as in many special cases it happens to work as expected, then in very similar other cases it doesn't. > > -- > Eric Blake eblake@HIDDEN +1-919-301-3266 > Libvirt virtualization library http://libvirt.org
bug-coreutils@HIDDEN
:bug#10880
; Package coreutils
.
Full text available.Received: (at 10880) by debbugs.gnu.org; 25 Feb 2012 03:31:25 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Feb 24 22:31:25 2012 Received: from localhost ([127.0.0.1]:54859 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1S18M0-000690-SP for submit <at> debbugs.gnu.org; Fri, 24 Feb 2012 22:31:25 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48399) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <eblake@HIDDEN>) id 1S18Lx-00068r-KW for 10880 <at> debbugs.gnu.org; Fri, 24 Feb 2012 22:31:23 -0500 Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q1P3Sh1X020233 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 24 Feb 2012 22:28:43 -0500 Received: from [10.3.113.113] (ovpn-113-113.phx2.redhat.com [10.3.113.113]) by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id q1P3Sgp5012220; Fri, 24 Feb 2012 22:28:42 -0500 Message-ID: <4F485569.2040002@HIDDEN> Date: Fri, 24 Feb 2012 20:28:41 -0700 From: Eric Blake <eblake@HIDDEN> Organization: Red Hat User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120209 Thunderbird/10.0.1 MIME-Version: 1.0 To: Marton Kadar <marton.kadar@HIDDEN> Subject: Re: bug#10880: instead of characters, tr works on bytes References: <20120224142912.107150@HIDDEN> In-Reply-To: <20120224142912.107150@HIDDEN> X-Enigmail-Version: 1.3.5 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig9A09A192FFF0EC7AC3962D02" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.25 X-Spam-Score: -6.9 (------) X-Debbugs-Envelope-To: 10880 Cc: 10880 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Sender: debbugs-submit-bounces <at> debbugs.gnu.org Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org X-Spam-Score: -6.9 (------) This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig9A09A192FFF0EC7AC3962D02 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 02/24/2012 07:29 AM, Marton Kadar wrote: > Don't know which is the official way to report a bug in 'tr' > so I will copy to this list too. CC me on replies as I am not > subscribing. Sending mail to coreutils@HIDDEN _is_ what creates a bug on debbugs.gnu.org, so you have managed to create a duplicate. Paul Eggert has already merged 9365, 10880, and 9569, so now, replying to any one of those three is merely adding information to the same report. >> >> Let us try to delete a character and see if it worked: >> >> $ echo =C3=A1rv=C3=ADz | tr -d =C3=A1 | od -c >> 0000000 r v 255 z \n >> 0000005 Please keep in mind that upstream coreutils is not yet converted over to multibyte support. This is evidence of one of the places that multibyte support is required, and therefore, where you cannot expect things to work yet. No one has yet contributed a maintainable patch that does not penalize single-byte locales, at least not upstream. Several distros have their own UTF-8 patches that they apply, but then, this would be a bug you report to your distro and not upstream. >> I'll check the source for tr myself although never coded in C. >> This should be a trivial fix. Alas, dealing with multibyte characters without penalizing single-byte locales is NOT trivial, or it would have been done long ago. --=20 Eric Blake eblake@HIDDEN +1-919-301-3266 Libvirt virtualization library http://libvirt.org --------------enig9A09A192FFF0EC7AC3962D02 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJPSFVqAAoJEKeha0olJ0NqBiMH/17qYuhYpdzTtDsqUEQAYl2I VFrnMIB7qMKkx1JoliWErIhNB9c2BPCo1fDcwpzpWPg6WF3MDicSVCprX4oFXqoP ekzGlcfeQVQ1HOLigXjuegmc9+uHkCFmX/9GEYqUQzz54zklVDpQS8UZTRzaB8db I/pVTsKVlnOLaN71f/CCALIbPx1428QXXfAslqF3vxqKGjOtXdNoSq6u96fuXocp FS+9uKezPv8b7CgebMQnAU5hnY3f1N3HZM7+xXBEIuvjlPccqiI8DiS8N4hSb1Xi 02U3GbwBLcvnWjQyHqxHnf1/pfIQdJUirg/5/GgzqUHmwHEpm2DoftIBbHQyceg= =1w/w -----END PGP SIGNATURE----- --------------enig9A09A192FFF0EC7AC3962D02--
bug-coreutils@HIDDEN
:bug#10880
; Package coreutils
.
Full text available.Paul Eggert <eggert@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at submit) by debbugs.gnu.org; 24 Feb 2012 17:30:17 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Feb 24 12:30:17 2012 Received: from localhost ([127.0.0.1]:54414 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1S0yyG-0004No-VL for submit <at> debbugs.gnu.org; Fri, 24 Feb 2012 12:30:17 -0500 Received: from eggs.gnu.org ([208.118.235.92]:56312) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from <marton.kadar@HIDDEN>) id 1S0wMJ-0008HQ-Jf for submit <at> debbugs.gnu.org; Fri, 24 Feb 2012 09:42:57 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <marton.kadar@HIDDEN>) id 1S0wJP-0002ff-Qc for submit <at> debbugs.gnu.org; Fri, 24 Feb 2012 09:40:19 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([140.186.70.17]:37480) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <marton.kadar@HIDDEN>) id 1S0wJP-0002fH-Nb for submit <at> debbugs.gnu.org; Fri, 24 Feb 2012 09:39:55 -0500 Received: from eggs.gnu.org ([208.118.235.92]:38765) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <marton.kadar@HIDDEN>) id 1S0wJG-00023V-3W for bug-coreutils@HIDDEN; Fri, 24 Feb 2012 09:39:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <marton.kadar@HIDDEN>) id 1S0wJE-0002Vr-2b for bug-coreutils@HIDDEN; Fri, 24 Feb 2012 09:39:49 -0500 Received: from mailout-us.gmx.com ([74.208.5.67]:44151) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from <marton.kadar@HIDDEN>) id 1S0wJD-0002VW-Rr for bug-coreutils@HIDDEN; Fri, 24 Feb 2012 09:39:44 -0500 Received: (qmail 28362 invoked by uid 0); 24 Feb 2012 14:29:13 -0000 Received: from 145.236.252.34 by rms-us010.v300.gmx.net with HTTP Content-Type: text/plain; charset="utf-8" Date: Fri, 24 Feb 2012 09:29:12 -0500 From: "Marton Kadar" <marton.kadar@HIDDEN> Message-ID: <20120224142912.107150@HIDDEN> MIME-Version: 1.0 Subject: instead of characters, tr works on bytes To: bug-coreutils@HIDDEN X-Authenticated: #77717673 X-Flags: 0001 X-Mailer: GMX.com Web Mailer x-registered: 0 Content-Transfer-Encoding: 8bit X-GMX-UID: 7/s1b79I3zOlOMiDynAhP75+IGRvb8BK X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Fri, 24 Feb 2012 12:30:14 -0500 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Sender: debbugs-submit-bounces <at> debbugs.gnu.org Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org X-Spam-Score: -1.9 (-) Don't know which is the official way to report a bug in 'tr' so I will copy to this list too. CC me on replies as I am not subscribing. > ----- Original Message ----- > From: Marton Kadar > Sent: 02/24/12 03:18 PM > To: 9365 <at> debbugs.gnu.org > Subject: Example > > Environment for Hungary where á and à are proper lowercase letters > but for example Spanish has these letters too: > > $ set | grep ^L > LANG=hu_HU.UTF-8 > LC_ALL=hu_HU.UTF-8 > LINES=73 > LOGNAME=kadar1marto518 > > Now let's see the bytestream for the following string > (which means flood in Hungarian): > > $ echo árvÃz | od -c > 0000000 303 241  r  v 303 255  z  \n > 0000010 > > Let us try to delete a character and see if it worked: > > $ echo árvÃz | tr -d á | od -c > 0000000  r  v 255  z  \n > 0000005 > > Correct expected behavior would rather be: > > $ echo árvÃz | tr -d á | od -c > 0000000  r  v 303 255  z  \n > 0000006 > > I'll check the source for tr myself although never coded in C. > This should be a trivial fix. The problem is especially annoying > as we currently have no real simple and good general purpose case > conversion tool. (correct me if I'm wrong, but tr should be this > tool). > > Marton Kadar
"Marton Kadar" <marton.kadar@HIDDEN>
:bug-coreutils@HIDDEN
.
Full text available.bug-coreutils@HIDDEN
:bug#10880
; Package coreutils
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.