GNU bug report logs - #21395
multibyte: cut and Spanish characters

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Michael Lee <michaellee213@HIDDEN>; dated Wed, 2 Sep 2015 00:54:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: cut and Spanish characters' from 'Bug with cut and Spanish characters from text file with UTF-8 encoding' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 21395 <at> debbugs.gnu.org:


Received: (at 21395) by debbugs.gnu.org; 2 Sep 2015 11:03:15 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 02 07:03:15 2015
Received: from localhost ([127.0.0.1]:45983 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ZX5ox-0004eu-Eq
	for submit <at> debbugs.gnu.org; Wed, 02 Sep 2015 07:03:15 -0400
Received: from mail2.vodafone.ie ([213.233.128.44]:37072)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <P@HIDDEN>) id 1ZX5ov-0004el-9s
 for 21395 <at> debbugs.gnu.org; Wed, 02 Sep 2015 07:03:13 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ag8OAFPW5lVtT8J4/2dsb2JhbABdgklSHzVqgT+BFU68JYV4AQICgTpMAQEBAQEBgQtBA4NgAQEEIw8BQRULDQsCAgUWCwICCQMCAQIBRQYBDAgBAQWIKQEItQWFb48cLIEihFaFdoUSgmmBQwWVSZYFkWAmgkGBPz2DAAEBAQ
Received: from unknown (HELO localhost.localdomain) ([109.79.194.120])
 by mail2.vodafone.ie with ESMTP; 02 Sep 2015 12:03:11 +0100
Message-ID: <55E6D76E.5070009@HIDDEN>
Date: Wed, 02 Sep 2015 12:03:10 +0100
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Michael Lee <michaellee213@HIDDEN>, 21395 <at> debbugs.gnu.org
Subject: Re: bug#21395: Bug with cut and Spanish characters from text file
 with UTF-8 encoding
References: <1569154567.83126.1441154469654.JavaMail.yahoo@HIDDEN>
In-Reply-To: <1569154567.83126.1441154469654.JavaMail.yahoo@HIDDEN>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 21395
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

On 02/09/15 01:41, Michael Lee wrote:
> When using cut as, "cut -c 1" with a text file with Spanish characters, it does not display those characters.
> For example, the character ã or á will not display if it is the first character and the file is trimmed using the cut command.

Debian/Ubuntu do not use the i18n patch used in Fedora/RHEL/Suse for example,
and so do not support multi-byte characters. Now that i18n patch is
problematic and incomplete, and there are plans to bring the
functionality upstream at some stage:

http://www.pixelbeat.org/docs/coreutils_i18n/

cheers,
Pádraig




Information forwarded to bug-coreutils@HIDDEN:
bug#21395; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Sep 2015 00:53:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 01 20:53:04 2015
Received: from localhost ([127.0.0.1]:45508 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ZWwIR-0003zM-1m
	for submit <at> debbugs.gnu.org; Tue, 01 Sep 2015 20:53:04 -0400
Received: from eggs.gnu.org ([208.118.235.92]:44287)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7t-0003kF-0j
 for submit <at> debbugs.gnu.org; Tue, 01 Sep 2015 20:42:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7r-00068K-D3
 for submit <at> debbugs.gnu.org; Tue, 01 Sep 2015 20:42:08 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: **
X-Spam-Status: No, score=2.9 required=5.0 tests=BAYES_50,FORGED_YAHOO_RCVD,
 FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLYTO_END_DIGIT,
 HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:55269)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7r-00068G-AC
 for submit <at> debbugs.gnu.org; Tue, 01 Sep 2015 20:42:07 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:54290)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7p-000146-VM
 for bug-coreutils@HIDDEN; Tue, 01 Sep 2015 20:42:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7l-0005xn-S8
 for bug-coreutils@HIDDEN; Tue, 01 Sep 2015 20:42:05 -0400
Received: from nm48-vm1.bullet.mail.bf1.yahoo.com ([216.109.115.156]:34020)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <michaellee213@HIDDEN>) id 1ZWw7l-0005uk-NH
 for bug-coreutils@HIDDEN; Tue, 01 Sep 2015 20:42:01 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048;
 t=1441154520; bh=DAntNfukD+3Xc1qh34idydtn9otK5sXsrE6J6+7aHf0=;
 h=Date:From:Reply-To:To:Subject:From:Subject;
 b=QsGe3OLzA+7togWYDs+jhfhfMA+tVX7BNqnmqTTsgCPHQPMVhSFO6TXhfIbd7Dx1shZEHaaRwA+6ApfLz5L427n08YBBsv/GeoZ8oZFI+yDz/5Rdu0WPoodeWb1pTI/DlrNozDYCweRXinenWNjHKVOLHTgN6Cw5zPo7mHysQ0ulOD4wetoVpWoARfKitVHx5Bn2v/zN7EBzyKAkuNSgyWrxpW34JIswlfUQq2+DfwkE8LDwYDjNOSpc0btOy0A5uRhGbZMVmL/p2ltXbMFQyZPZ736Xm5eFeQxjHcwwTC+qxTTwyyKjwpk9NrebDOLzHThc4nCoiWSO/drloYrLDw==
Received: from [98.139.215.142] by nm48.bullet.mail.bf1.yahoo.com with NNFMP;
 02 Sep 2015 00:42:00 -0000
Received: from [98.139.212.200] by tm13.bullet.mail.bf1.yahoo.com with NNFMP;
 02 Sep 2015 00:42:00 -0000
Received: from [127.0.0.1] by omp1009.mail.bf1.yahoo.com with NNFMP;
 02 Sep 2015 00:42:00 -0000
X-Yahoo-Newman-Property: ymail-3
X-Yahoo-Newman-Id: 745314.72874.bm@HIDDEN
X-YMail-OSG: vXoYRM8VM1l1ygK3PVatHhWIaM4x9lwh4EhpQp1vDqrCivI6p7q4Uei41KtbH0G
 0eF9x02b.ryiEC6kl5KOZOOJOAyeHf5D7cFGdnl0ivphN1.R6.yx4tmXzxDaJ4P.DphswFEkfvSt
 vEBOZhbwBhiipiWuC1LlOF9V4xNBMfK1ApDwAvq2IwpvQixzXILEBaeYArJheXl6yRQipYcYi8Ko
 Zcgtrb5zJgBdWCZzUEV5vMXVSuByiDBxxEyMVif4s5MWvQTwcpX3WZBR1IcQ8HtPcPZIBEU5HgTf
 hoPA82HftKEhijhpx2.M130iGJle4BXqT4uRcb0cfH6ZhHQAcXVYvMR4_S_WwiIQhWFyexaI5eEq
 1qXKOWUzSZd_Cmgf7nG2PNr7lb.x2wFs3D4rSlEI1woXAiXARIN8cFbtphRkPsaNWdbtNODv1TH.
 SAoYhLCmE8n5zUFFKV0DdtjaIqcAEkRDki.B9CcWLxCiVx8Gh9dj4dDPREbvM8CJ8rCWKreBkgEp
 cXiylWd7xlx8W
Received: by 66.196.80.145; Wed, 02 Sep 2015 00:42:00 +0000 
Date: Wed, 2 Sep 2015 00:41:09 +0000 (UTC)
From: Michael Lee <michaellee213@HIDDEN>
To: "bug-coreutils@HIDDEN" <bug-coreutils@HIDDEN>
Message-ID: <1569154567.83126.1441154469654.JavaMail.yahoo@HIDDEN>
Subject: Bug with cut and Spanish characters from text file with UTF-8 encoding
MIME-Version: 1.0
Content-Type: multipart/alternative; 
 boundary="----=_Part_83125_1843901992.1441154469646"
Content-Length: 7153
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -2.8 (--)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Tue, 01 Sep 2015 20:53:01 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Michael Lee <michaellee213@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.8 (--)

------=_Part_83125_1843901992.1441154469646
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

To whom it may concern:
To preface the explanation of this possible bug, the following was tested:
Encoding(s) was/were determined by opening the Spanish text files with vi a=
nd using ":set" to view the encoding type(s).

Text files containing Spanish letters/characters were used in this test.=C2=
=A0 First, the locale in the bash shell was set to UTF-8 (default setting w=
ith Ubuntu) and the encoding on the first test file was encoded with Latin1=
.=C2=A0 Under these conditions head and tail were used to try to output sev=
eral Spanish letters/characters with accents above the letter.=C2=A0 Trying=
 to use "head spanish.txt" and "tail spanish.txt" resulted in output with s=
paces in place of the Spanish letters/characters.
After spanish.txt was converted from Latin1 to UTF-8 with iconv, the test w=
as repeated with the head and tail utilities and then the output was correc=
t.=C2=A0 The Spanish letters/characters then displayed correctly instead of=
 what previously appeared to be blank spaces.=C2=A0 When the "cut" command =
was added to this, the behavior of spaces taking the place of letters retur=
ned.
For example, "head -n 50 spanish.txt | cut -c 1" or "tail -n 50 spanish.txt=
 | cut -c 1" will result in the first character showing only blank spaces w=
here there are Spanish letters/characters.=C2=A0 Letters with accents are d=
isplayed as blank spaces.=C2=A0 Using only head or tail will show the Spani=
sh letters correctly, but not with the cut command.

When using cut as, "cut -c 1" with a text file with Spanish characters, it =
does not display those characters.
For example, the character =C3=A3 or =C3=A1 will not display if it is the f=
irst character and the file is trimmed using the cut command.
Converting the file from Latin1 to UTF-8 solved the problem with head and t=
ail, but not cut.
The cut command does not seem to output the special letters/characters corr=
ectly.
Is there an environment variable that could fix this or could it possibly b=
e a bug?
Thank you for your time.
Sincerely,Michael Lee
=20

------=_Part_83125_1843901992.1441154469646
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"color:#000; background-color:#fff; font-family:He=
lveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;fo=
nt-size:13px"><div id=3D"yui_3_16_0_1_1441151974321_2656">To whom it may co=
ncern:</div><div id=3D"yui_3_16_0_1_1441151974321_2657"><br></div><div id=
=3D"yui_3_16_0_1_1441151974321_2658">To preface the explanation of this pos=
sible bug, the following was tested:</div><div id=3D"yui_3_16_0_1_144115197=
4321_3644"><br></div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1441151974321_3436=
">Encoding(s) was/were determined by opening the Spanish text files with vi=
 and using ":set" to view the encoding type(s).<br></div><div id=3D"yui_3_1=
6_0_1_1441151974321_3606" dir=3D"ltr"><br></div><div id=3D"yui_3_16_0_1_144=
1151974321_3604" dir=3D"ltr">Text files containing Spanish letters/characte=
rs were used in this test.&nbsp; First, the locale in the bash shell was se=
t to UTF-8 (default setting with Ubuntu) and the encoding on the first test=
 file was encoded with Latin1.&nbsp; Under these conditions head and tail w=
ere used to try to output several Spanish letters/characters with accents a=
bove the letter.&nbsp; Trying to use "head spanish.txt" and "tail spanish.t=
xt" resulted in output with spaces in place of the Spanish letters/characte=
rs.</div><div id=3D"yui_3_16_0_1_1441151974321_3603" dir=3D"ltr"><br></div>=
<div id=3D"yui_3_16_0_1_1441151974321_3602" dir=3D"ltr">After spanish.txt w=
as converted from Latin1 to UTF-8 with iconv, the test was repeated with th=
e head and tail utilities and then the output was correct.&nbsp; The Spanis=
h letters/characters then displayed correctly instead of what previously ap=
peared to be blank spaces.&nbsp; When the "cut" command was added to this, =
the behavior of spaces taking the place of letters returned.</div><div id=
=3D"yui_3_16_0_1_1441151974321_3759" dir=3D"ltr"><br></div><div id=3D"yui_3=
_16_0_1_1441151974321_3746" dir=3D"ltr">For example, "head -n 50 spanish.tx=
t | cut -c 1" or "tail -n 50 spanish.txt | cut -c 1" will result in the fir=
st character showing only blank spaces where there are Spanish letters/char=
acters.&nbsp; Letters with accents are displayed as blank spaces.&nbsp; Usi=
ng only head or tail will show the Spanish letters correctly, but not with =
the cut command.<br></div><div id=3D"yui_3_16_0_1_1441151974321_3435"><br><=
/div><div id=3D"yui_3_16_0_1_1441151974321_3418">When using cut as, "cut -c=
 1" with a text file with Spanish characters, it does not display those cha=
racters.</div><div id=3D"yui_3_16_0_1_1441151974321_2672"><br></div><div di=
r=3D"ltr" class=3D"" id=3D"yui_3_16_0_1_1441151974321_3294" style=3D"margin=
-bottom: 0in; line-height: 100%">For example, the character =C3=A3 or =C3=
=A1 will not display if it is the first character and the file is trimmed u=
sing the cut command.</div><div id=3D"yui_3_16_0_1_1441151974321_3393" dir=
=3D"ltr" class=3D"" style=3D"margin-bottom: 0in; line-height: 100%"><br></d=
iv><div id=3D"yui_3_16_0_1_1441151974321_3461" dir=3D"ltr" class=3D"" style=
=3D"margin-bottom: 0in; line-height: 100%">Converting the file from Latin1 =
to UTF-8 solved the problem with head and tail, but not cut.</div><div id=
=3D"yui_3_16_0_1_1441151974321_3839" dir=3D"ltr" class=3D"" style=3D"margin=
-bottom: 0in; line-height: 100%"><br></div><div id=3D"yui_3_16_0_1_14411519=
74321_3838" dir=3D"ltr" class=3D"" style=3D"margin-bottom: 0in; line-height=
: 100%">The cut command does not seem to output the special letters/charact=
ers correctly.</div><div id=3D"yui_3_16_0_1_1441151974321_3878" dir=3D"ltr"=
 class=3D"" style=3D"margin-bottom: 0in; line-height: 100%"><br></div><div =
id=3D"yui_3_16_0_1_1441151974321_3877" dir=3D"ltr" class=3D"" style=3D"marg=
in-bottom: 0in; line-height: 100%">Is there an environment variable that co=
uld fix this or could it possibly be a bug?</div><div id=3D"yui_3_16_0_1_14=
41151974321_3876" dir=3D"ltr" class=3D"" style=3D"margin-bottom: 0in; line-=
height: 100%"><br></div><div id=3D"yui_3_16_0_1_1441151974321_3875" dir=3D"=
ltr" class=3D"" style=3D"margin-bottom: 0in; line-height: 100%">Thank you f=
or your time.</div><div id=3D"yui_3_16_0_1_1441151974321_3874" dir=3D"ltr" =
class=3D"" style=3D"margin-bottom: 0in; line-height: 100%"><br></div><div i=
d=3D"yui_3_16_0_1_1441151974321_3872" dir=3D"ltr" class=3D"" style=3D"margi=
n-bottom: 0in; line-height: 100%">Sincerely,</div><div id=3D"yui_3_16_0_1_1=
441151974321_3873" dir=3D"ltr" class=3D"" style=3D"margin-bottom: 0in; line=
-height: 100%">Michael Lee<br></div><div dir=3D"ltr">

</div><div dir=3D"ltr">

</div><div dir=3D"ltr">

</div><div id=3D"yui_3_16_0_1_1441151974321_2709">=20


=09
=09
=09
=09


</div><div id=3D"yui_3_16_0_1_1441151974321_2611"><br></div></div></body></=
html>
------=_Part_83125_1843901992.1441154469646--




Acknowledgement sent to Michael Lee <michaellee213@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to bug-coreutils@HIDDEN:
bug#21395; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.