I'm not sure if this is bug or if I'm using it wrong. As a matter of fact, I tested this on several systems, and on BSD-based systems (Mac) the tr tool gives different results -- the one I expected. The simplest way to reproduce this looks like this (sorry, umlaut ahead): $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd % 00000000: c3 . The echo prints a capital A with a circumflex (=C2), and I expect the tr command to delete everything except the small umlaut =E4. It looks as if tr just deletes the second byte. When I try without the umlaut it gives me the empty result, as expected: $ echo -ne "\xc3\x82" | tr -cd "a" | xxd [empty result] I tested several systems, the oldest is a Debian with coreutils 8.5, the newest an Ubuntu with coreutils 8.25. For the moment, I'll try to solve my problem differently, but... is this a bug? Thanks in advance! Regards, Ronald. --=20 There is no reason for any individual to have a computer in his home. (Ken Olsen, DEC)
tags 26362 notabug wishlist stop 26362 Hello, > On Apr 4, 2017, at 10:01, Ronald Schaten <ronald@HIDDEN> = wrote: >=20 > I'm not sure if this is bug or if I'm using it wrong. Neither - it is simply the GNU tr does not yet support multibyte = characters. > The simplest way to reproduce this looks like this (sorry, umlaut > ahead): >=20 > $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd > % 00000000: c3 . >=20 > The echo prints a capital A with a circumflex (=C2), and I expect the = tr > command to delete everything except the small umlaut =E4. It looks as = if > tr just deletes the second byte. What happened here is this: 'tr' currently reads the input string parameter (SET1) as single-byte, = and so treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 = encoding of small A with umlaut). Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82. > When I try without the umlaut it gives me the empty result, as = expected: >=20 > $ echo -ne "\xc3\x82" | tr -cd "a" | xxd Indeed, because here you're asking to keep only octets whose value is \x61 (the ASCII value of 'a') - neither "\xC3" not "\x82" match and so they are deleted. > For the moment, I'll try to solve my problem differently, but... is = this > a bug? Thanks in advance! Not a bug - but a yet-missing feature. For relevant discussion see here: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24924#8 As a temporary work-around, you can use gnu sed which is = multibyte-aware: $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^=E4]//g' =E4 And 'sed' supports one more thing called "character equivalent class": The the following examples, all characters except those that are = equivalent to 'a' will be deleted: $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=3Da=3D]]//g' a=E4=C2 'Character equivalent class' will work with future 'tr' as well once multibyte-support is added. Lastly, "echo -en" is not portable. It is recommended to use "printf" instead. "printf" has the added advantage that it supports unicode code-points directly, instead of having to know the UTF-8 encoding of a unicode = character, e.g.: printf "\u00c2\n" will print capital A with circumflex (and will work in other locales if = they support this character, not just UTF-8). I'm thus marking this item as "wishlist" and "notabug", but I'll keep it open until it is implemented. Discussion can continue by replying to this thread. regards, - assaf
Subject: Re: bug#26362: tr -cd -- Problem with UTF-8? Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\)) From: Assaf Gordon <assafgordon@HIDDEN> Date: Tue, 4 Apr 2017 22:19:15 -0400 tags 26362 notabug wishlist stop 26362 Hello, > On Apr 4, 2017, at 10:01, Ronald Schaten <ronald@HIDDEN> = wrote: >=20 > I'm not sure if this is bug or if I'm using it wrong. Neither - it is simply the GNU tr does not yet support multibyte = characters. > The simplest way to reproduce this looks like this (sorry, umlaut > ahead): >=20 > $ echo -ne "\xc3\x82" | tr -cd "=E4" | xxd > % 00000000: c3 . >=20 > The echo prints a capital A with a circumflex (=C2), and I expect the = tr > command to delete everything except the small umlaut =E4. It looks as = if > tr just deletes the second byte. What happened here is this: 'tr' currently reads the input string parameter (SET1) as single-byte, = and so treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 = encoding of small A with umlaut). Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82. > When I try without the umlaut it gives me the empty result, as = expected: >=20 > $ echo -ne "\xc3\x82" | tr -cd "a" | xxd Indeed, because here you're asking to keep only octets whose value is \x61 (the ASCII value of 'a') - neither "\xC3" not "\x82" match and so they are deleted. > For the moment, I'll try to solve my problem differently, but... is = this > a bug? Thanks in advance! Not a bug - but a yet-missing feature. For relevant discussion see here: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24924#8 As a temporary work-around, you can use gnu sed which is = multibyte-aware: $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^=E4]//g' =E4 And 'sed' supports one more thing called "character equivalent class": The the following examples, all characters except those that are = equivalent to 'a' will be deleted: $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=3Da=3D]]//g' a=E4=C2 'Character equivalent class' will work with future 'tr' as well once multibyte-support is added. Lastly, "echo -en" is not portable. It is recommended to use "printf" instead. "printf" has the added advantage that it supports unicode code-points directly, instead of having to know the UTF-8 encoding of a unicode = character, e.g.: printf "\u00c2\n" will print capital A with circumflex (and will work in other locales if = they support this character, not just UTF-8). I'm thus marking this item as "wishlist" and "notabug", but I'll keep it open until it is implemented. Discussion can continue by replying to this thread. regards, - assaf
