GNU bug report logs - #17196
multibyte: printf: %s counts bytes instead of characters

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Jan Novak <jn@HIDDEN>; dated Sat, 5 Apr 2014 23:22:01 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: printf: %s counts bytes instead of characters' from 'UTF-8 printf string formating problem' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 9 May 2014 02:16:55 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu May 08 22:16:55 2014
Received: from localhost ([127.0.0.1]:56448 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WiaMo-0003zq-NU
	for submit <at> debbugs.gnu.org; Thu, 08 May 2014 22:16:55 -0400
Received: from nm10-vm0.bullet.mail.bf1.yahoo.com ([98.139.213.147]:21407)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <lsatenstein@HIDDEN>) id 1WiaMk-0003zX-Sj
 for 17196 <at> debbugs.gnu.org; Thu, 08 May 2014 22:16:52 -0400
Received: from [98.139.212.153] by nm10.bullet.mail.bf1.yahoo.com with NNFMP;
 09 May 2014 02:16:45 -0000
Received: from [98.139.212.238] by tm10.bullet.mail.bf1.yahoo.com with NNFMP;
 09 May 2014 02:16:45 -0000
Received: from [127.0.0.1] by omp1047.mail.bf1.yahoo.com with NNFMP;
 09 May 2014 02:16:45 -0000
X-Yahoo-Newman-Property: ymail-3
X-Yahoo-Newman-Id: 366228.35034.bm@HIDDEN
Received: (qmail 10977 invoked by uid 60001); 9 May 2014 02:16:45 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
 t=1399601805; bh=s2lLlN9FNGWVV+Ys4HgZtPKK5Yo/Je4MtFyMdDot+Cg=;
 h=References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
 b=iUTBJytk3qfsoXNT1/ZssEUkHrFaV8vbdaD3BrOWlWSEaZcdIfWkvKw33wlp6LYbjB6PjUFZjJaxqcp+515qfQd2QIG07mPjaMihbKdq23Rdw/rJTd3xyHkYBhR1b/3Pv+QzQPifA60WRVr9BMOTQYAcDd5E9MYSn6+PLDaLPaQ=
X-YMail-OSG: kyO7YnUVM1lmX3HJKnrvaCeLXs.fewIl6ziBVKZGmLuxukr
 RZA5_MvUG4pBIByhIg_C0of8q8uVZCJJRsleHLSvMLYZ5702uNu9rfc.C9Ju
 ETVQK5NKkGjwFWi.o4uhyNEJtILw985MGWW3gdgjbMdLJQUJvrIuzCQdf2o0
 eygarJdIkhwccj_bRLQRzO40nj_NTKqM72f1naaaC0d.9GKOANyipWBwwmxE
 _aRwqxoH8wJZILB6suB08ILFQlU.MiU105or8kn9ctnRNup5Q6k06MzeMb8P
 _sMSWdM11vYHmQxxXjU_f_q6FGWxBQYDijzTc97bYX2gMiIeJvVCRwTyr7Fz
 MZJgFZVM7RY8M6YAT7QUtMbfDj7P42d_OGTk2e.YOEJjsl0NR4zyCNqVgIyz
 s8D3sQ__BaExDXnXflngUJsAiTP1lWfJLt51ITcYU1LjIBsO2jmgFoPcgbdL
 nZY1M54Hi7H3dSfT6q17iGrMsNw4ZXRxJojCGHb2lqc.RnWDGwlgYw1CjC6j
 FKBGtJdmt4HODujXKiG1nVx5yvlH51gVqqCqIGIzL9CSdFg--
Received: from [70.49.120.43] by web142606.mail.bf1.yahoo.com via HTTP;
 Thu, 08 May 2014 19:16:45 PDT
X-Rocket-MIMEInfo: 002.001,
 UGVyaGFwcyBwcmludGYoKSBuZWVkcyBzb21lIHdpZGUgY2hhcmFjdGVyIGV4dGVuc2lvbnMgdmlhICVuZXcgY2hhcmFjdGVycwoKwqAKUmVnYXJkcyAKCsKgTGVzbGllCgpNci4gTGVzbGllIFNhdGVuc3RlaW4KU0VOVCBGUk9NIE1ZIE9QRU4gU09VUkNFIExJTlVYIFNZU1RFTS4KCgoKCj5fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwo.IEZyb206IFDDoWRyYWlnIEJyYWR5IDxQQGRyYWlnQnJhZHkuY29tPgo.VG86IEphbiBOb3ZhayA8am5AdHVyYm8uc2s.IAo.Q2M6IDE3MTk2QGRlYmJ1Z3MuZ24BMAEBAQE-
X-Mailer: YahooMailWebService/0.8.188.663
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
Message-ID: <1399601805.73330.YahooMailNeo@HIDDEN>
Date: Thu, 8 May 2014 19:16:45 -0700 (PDT)
From: Leslie S Satenstein <lsatenstein@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
To: Jan Novak <jn@HIDDEN>
In-Reply-To: <53412952.1040506@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="562241088-351124307-1399601805=:73330"
X-Spam-Score: -0.6 (/)
X-Debbugs-Envelope-To: 17196
Cc: "17196 <at> debbugs.gnu.org" <17196 <at> debbugs.gnu.org>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Leslie S Satenstein <lsatenstein@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.6 (/)

--562241088-351124307-1399601805=:73330
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Perhaps printf() needs some wide character extensions via %new characters=
=0A=0A=A0=0ARegards =0A=0A=A0Leslie=0A=0AMr. Leslie Satenstein=0ASENT FROM =
MY OPEN SOURCE LINUX SYSTEM.=0A=0A=0A=0A=0A>_______________________________=
_=0A> From: P=E1draig Brady <P@HIDDEN>=0A>To: Jan Novak <jn@HIDDEN=
k> =0A>Cc: 17196 <at> debbugs.gnu.org =0A>Sent: Sunday, April 6, 2014 6:15 AM=0A=
>Subject: bug#17196: UTF-8 printf string formating  problem=0A> =0A>=0A>On =
04/06/2014 12:17 AM, Jan Novak wrote:=0A>> Hello,=0A>> =0A>> printf string =
format counts bytes instead of chars, which leads to broken output ...=0A>>=
 (the same problem occurs with bash built in printf)=0A>> =0A>> =0A>> just =
try this:=0A>> =0A>> $ echo $LANG=0A>> us_US.UTF-8=0A>> =0A>> =0A>> $ print=
f "|%3s|\n" "a"=0A>> |=A0 a|=0A>> =0A>> $ printf "|%3s|\n" "=E1"=A0 =A0  (c=
har is a-acute)=0A>> | =E1|=0A>> =0A>> expected output:=0A>> |=A0 =E1|=0A>>=
 =0A>> Is there some easy solution ?=0A>> =0A>> TIA for the answer=0A>=0A>Y=
es printf follows the C standard which only considers bytes.=0A>awk does re=
spect characters in width specifiers though:=0A>=0A>=A0 $ awk 'BEGIN{printf=
 "|%3s|\n", "=E1"}'=0A>=A0 |=A0 =E1|=0A>=0A>I don't think we'd be able to c=
hange the current operation of printf=0A>due to backwards compat reasons? T=
hough we might be able to somehow leverage=0A>the existing multibyte charac=
ter aware alignment/truncation code in:=0A>http://git.sv.gnu.org/gitweb/?p=
=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsalign.c;hb=3DHEAD=0A>=0A>thanks,=0A=
>P=E1draig.=0A>=0A>=0A>=0A>=0A>=0A>
--562241088-351124307-1399601805=:73330
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"color:#000; background-color:#fff; font-family:He=
lveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;fo=
nt-size:14pt"><div><span>Perhaps printf() needs some wide character extensi=
ons via %new characters<br></span></div><div>&nbsp;</div><div><div><div><di=
v><div><div><div><span style=3D"" lang=3D"FR-CA">Regards</span>  <div><b><f=
ont size=3D"2"><br></font><font size=3D"2">&nbsp;Leslie</font><br></b></div=
> <div><font color=3D"green"><b><font size=3D"1">Mr. Leslie Satenstein</fon=
t></b></font><font style=3D"color:rgb(191, 0, 95);" color=3D"green" size=3D=
"1"><span style=3D"font-weight:bold;"></span></font><br></div><font color=
=3D"green" size=3D"2"><b>SENT FROM MY OPEN SOURCE LINUX SYSTEM.</b><br></fo=
nt><br><font face=3D"lucida console, sans-serif" size=3D"1"><b><font color=
=3D"black"><span style=3D"font-weight:bold;font-size:13.5pt;color:black;"><=
/span></font></b></font></div></div></div></div></div></div></div><div><br>=
</div><blockquote
 style=3D"border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; margin=
-top: 5px; padding-left: 5px;">  <div style=3D"font-family: HelveticaNeue, =
Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 14p=
t;"> <div style=3D"font-family: HelveticaNeue, Helvetica Neue, Helvetica, A=
rial, Lucida Grande, sans-serif; font-size: 12pt;"> <div dir=3D"ltr"> <hr s=
ize=3D"1">  <font face=3D"Arial" size=3D"2"> <b><span style=3D"font-weight:=
bold;">From:</span></b> P=E1draig Brady &lt;P@HIDDEN&gt;<br> <b><sp=
an style=3D"font-weight: bold;">To:</span></b> Jan Novak &lt;jn@HIDDEN&gt=
; <br><b><span style=3D"font-weight: bold;">Cc:</span></b> 17196@HIDDEN=
u.org <br> <b><span style=3D"font-weight: bold;">Sent:</span></b> Sunday, A=
pril 6, 2014 6:15 AM<br> <b><span style=3D"font-weight: bold;">Subject:</sp=
an></b> bug#17196: UTF-8 printf string formating  problem<br> </font> </div=
> <div class=3D"y_msg_container"><br>On 04/06/2014 12:17 AM, Jan Novak wrot=
e:<br>&gt;
 Hello,<br>&gt; <br>&gt; printf string format counts bytes instead of chars=
, which leads to broken output ...<br>&gt; (the same problem occurs with ba=
sh built in printf)<br>&gt; <br>&gt; <br>&gt; just try this:<br>&gt; <br>&g=
t; $ echo $LANG<br>&gt; us_US.UTF-8<br>&gt; <br>&gt; <br>&gt; $ printf "|%3=
s|\n" "a"<br>&gt; |&nbsp; a|<br>&gt; <br>&gt; $ printf "|%3s|\n" "=E1"&nbsp=
; &nbsp;  (char is a-acute)<br>&gt; | =E1|<br>&gt; <br>&gt; expected output=
:<br>&gt; |&nbsp; =E1|<br>&gt; <br>&gt; Is there some easy solution ?<br>&g=
t; <br>&gt; TIA for the answer<br><br>Yes printf follows the C standard whi=
ch only considers bytes.<br>awk does respect characters in width specifiers=
 though:<br><br>&nbsp; $ awk 'BEGIN{printf "|%3s|\n", "=E1"}'<br>&nbsp; |&n=
bsp; =E1|<br><br>I don't think we'd be able to change the current operation=
 of printf<br>due to backwards compat reasons? Though we might be able to s=
omehow leverage<br>the existing multibyte character aware
 alignment/truncation code in:<br><a href=3D"http://git.sv.gnu.org/gitweb/?=
p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsalign.c;hb=3DHEAD" target=3D"_blan=
k">http://git.sv.gnu.org/gitweb/?p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsa=
lign.c;hb=3DHEAD</a><br><br>thanks,<br>P=E1draig.<br><br><br><br><br><br></=
div> </div> </div> </blockquote><div></div>   </div></body></html>
--562241088-351124307-1399601805=:73330--




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 13:41:08 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 09:41:08 2014
Received: from localhost ([127.0.0.1]:45289 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WYbha-0005B4-Nf
	for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 09:41:07 -0400
Received: from forward7l.mail.yandex.net ([84.201.143.140]:42991)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <sdaoden@HIDDEN>) id 1WYbhO-00059w-Gk
 for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 09:40:56 -0400
Received: from smtp4h.mail.yandex.net (smtp4h.mail.yandex.net [84.201.186.21])
 by forward7l.mail.yandex.net (Yandex) with ESMTP id 08AE8BC121D;
 Fri, 11 Apr 2014 17:40:46 +0400 (MSK)
Received: from smtp4h.mail.yandex.net (localhost [127.0.0.1])
 by smtp4h.mail.yandex.net (Yandex) with ESMTP id B3F682C372C;
 Fri, 11 Apr 2014 17:40:45 +0400 (MSK)
Received: from unknown (unknown [89.204.130.136])
 by smtp4h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 9aGtRyfnHc-ehhehFk0; 
 Fri, 11 Apr 2014 17:40:44 +0400
 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
 (Client certificate not present)
X-Yandex-Uniq: fefe3f1e-da47-4b6e-9f5e-ec97a4eeadf9
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail;
 t=1397223645; bh=iKgWkZ1dAoJGcmbN1EZ9S055yr2JEhe3s9NOxejS4+w=;
 h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To:
 User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding;
 b=QxxGnYULunGfb82f84Eqj3sRYGuu9/mlfCwoVCTo1DT2909HLr6mHoncI8ahkR9tp
 KkiM0cm2yyw2dEpdj7VboS0bMDd0L6uF6rRWocOTYmmTve5S8OOmr5pBMnts3PJ0+p
 enBLx5Nd7yyi7Hyhv+gEx2woEGJXU9NtJNekZLq4=
Authentication-Results: smtp4h.mail.yandex.net; dkim=pass header.i=@yandex.com
Date: Fri, 11 Apr 2014 15:40:41 +0200
From: Steffen Nurpmeso <sdaoden@HIDDEN>
To: chet.ramey@HIDDEN
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140411144041.KmpeitNBK3J2xP1tlaqPyJ+P@HIDDEN>
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
 <20140410075610.GO26358@HIDDEN>
 <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
 <5346DE92.9020004@HIDDEN>
 <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN>
 <5347DF27.50702@HIDDEN>
In-Reply-To: <5347DF27.50702@HIDDEN>
User-Agent: s-nail v14.6.4-1-ga39836e
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, =?ISO-8859-1?Q?P=E1draig?= Brady <P@HIDDEN>,
 Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>,
 Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>,
 Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Chet Ramey <chet.ramey@HIDDEN> wrote:
 |On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote:
 |> Hello,
 |>=20
 |> Chet Ramey <chet.ramey@HIDDEN> wrote:
 |>|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
 |>|
 |>|> Even better would nonetheless be the great picture with
 |>|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
 |>|> a tuple of {[EastAsianWidth indication,] is-combining,
 |>|> width-if-non-combining} and best even some composition function.
 |>|
 |>|But we have always been at war with EastAsia!
 |>=20
 |> I see you really would love to get a hand from POSIX too:
 |
 |I'm sorry, I realize that was rather obscure.  It's from "1984", by Georg=
e
 |Orwell.  It's a central theme to the book.  The quote was an attempt to

oh, ah, yes.  So.. i got it right without getting it right.

Interestingly, yesterday started a retrospective work on Walter
Benjamin (<http://www.eingedenken.de/enter.html> --
"rememberance"): an artist (Christoph Korn) walked hist last trip
from Banyuls-sur-Mer (France) to Portbou (Spain; where he
committed suicide due to the impossibility to reach the U.S.),
following a fixated time frame (monotonic tick, so to say) after
which he spoke thesis of Benjamin (like, e.g., "There is no
document of civilization which is not at the same time a document
of barbarism."), followed by holding in and taking a (steady cam)
video of the recent leg.  Association with Paul Klees "Angelus
Novus" is desired (from both parties).

 |inject levity into the discussion.

That was easy.

--steffen




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 12:26:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 08:26:04 2014
Received: from localhost ([127.0.0.1]:45241 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WYaWw-0001vM-Dy
	for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 08:26:04 -0400
Received: from mpv1.tis.cwru.edu ([129.22.105.36]:19616)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <chet.ramey@HIDDEN>) id 1WYaWp-0001uc-2I
 for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 08:25:59 -0400
Received: from mpv5.tis.CWRU.Edu (EHLO mpv5.cwru.edu) ([129.22.105.51])
 by mpv1.tis.cwru.edu (MOS 4.3.5-GA FastPath queued)
 with ESMTP id BFC56788; Fri, 11 Apr 2014 08:25:38 -0400 (EDT)
Received: from caleb.INS.CWRU.Edu (EHLO caleb.ins.cwru.edu) ([129.22.8.211])
 by mpv5.cwru.edu (MOS 4.3.5-GA FastPath queued)
 with ESMTP id ATQ66868 (AUTH cpr);
 Fri, 11 Apr 2014 08:25:22 -0400 (EDT)
Message-ID: <5347DF27.50702@HIDDEN>
Date: Fri, 11 Apr 2014 08:25:11 -0400
From: Chet Ramey <chet.ramey@HIDDEN>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Steffen Nurpmeso <sdaoden@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN>
 <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
 <20140410075610.GO26358@HIDDEN>
 <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
 <5346DE92.9020004@HIDDEN>
 <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN>
In-Reply-To: <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Junkmail-Status: score=10/50, host=mpv5.cwru.edu
X-Junkmail-Whitelist: YES (by domain whitelist at mpv1.tis.cwru.edu)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, chet.ramey@HIDDEN,
 =?ISO-8859-1?Q?P=E1draig_Brady?= <P@HIDDEN>,
 Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>,
 Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>,
 Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: chet.ramey@HIDDEN
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote:
> Hello,
> 
> Chet Ramey <chet.ramey@HIDDEN> wrote:
>  |On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
>  |
>  |> Even better would nonetheless be the great picture with
>  |> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
>  |> a tuple of {[EastAsianWidth indication,] is-combining,
>  |> width-if-non-combining} and best even some composition function.
>  |
>  |But we have always been at war with EastAsia!
> 
> I see you really would love to get a hand from POSIX too:

I'm sorry, I realize that was rather obscure.  It's from "1984", by George
Orwell.  It's a central theme to the book.  The quote was an attempt to
inject levity into the discussion.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@HIDDEN    http://cnswww.cns.cwru.edu/~chet/




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 10:16:34 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 06:16:34 2014
Received: from localhost ([127.0.0.1]:45202 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WYYVd-0003iz-FJ
	for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 06:16:33 -0400
Received: from forward7l.mail.yandex.net ([84.201.143.140]:36493)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <sdaoden@HIDDEN>) id 1WYYVZ-0003iN-Cq
 for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 06:16:31 -0400
Received: from smtp3h.mail.yandex.net (smtp3h.mail.yandex.net [84.201.186.20])
 by forward7l.mail.yandex.net (Yandex) with ESMTP id 9B562BC0CD3;
 Fri, 11 Apr 2014 14:16:21 +0400 (MSK)
Received: from smtp3h.mail.yandex.net (localhost [127.0.0.1])
 by smtp3h.mail.yandex.net (Yandex) with ESMTP id 6A4581B42685;
 Fri, 11 Apr 2014 14:16:20 +0400 (MSK)
Received: from unknown (unknown [89.204.130.136])
 by smtp3h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id aicjDkOG4s-GI5WWnkl; 
 Fri, 11 Apr 2014 14:16:19 +0400
 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
 (Client certificate not present)
X-Yandex-Uniq: a0e620ee-628c-4bf3-a359-9abdf75c88a8
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail;
 t=1397211380; bh=nz2MoLGIO2JYvfrGDO7XeE09n04cOIPAGjaZG4ykzqc=;
 h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To:
 User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding;
 b=Ti5Ca3FysnnwnBuY/Fd5aCWPAYU/inHj4IvhDMay0u9OuDmINGRbmApwNu+7Yblsl
 Jz6/mC7WuDqHXD6S4i9nZQy/Mqn8+2p1V8uVfZpvBiYQweQ/M5YGGR//LigMVwp5UY
 OUd4pA4JjRioqaUir/EATe5BUbXp/ToDisCRPjTk=
Authentication-Results: smtp3h.mail.yandex.net; dkim=pass header.i=@yandex.com
Date: Fri, 11 Apr 2014 12:16:15 +0200
From: Steffen Nurpmeso <sdaoden@HIDDEN>
To: chet.ramey@HIDDEN
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN>
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
 <20140410075610.GO26358@HIDDEN>
 <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
 <5346DE92.9020004@HIDDEN>
In-Reply-To: <5346DE92.9020004@HIDDEN>
User-Agent: s-nail v14.6.4-1-ga39836e
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, =?ISO-8859-1?Q?P=E1draig?= Brady <P@HIDDEN>,
 Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>,
 Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>,
 Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Hello,

Chet Ramey <chet.ramey@HIDDEN> wrote:
 |On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
 |
 |> Even better would nonetheless be the great picture with
 |> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
 |> a tuple of {[EastAsianWidth indication,] is-combining,
 |> width-if-non-combining} and best even some composition function.
 |
 |But we have always been at war with EastAsia!

I see you really would love to get a hand from POSIX too:

  ?0[steffen@sherwood bash-4.3]$ grep -r UNICODE_COMB .                    =
                                                                        =20
  ./lib/readline/display.c:      if (t > 0 && UNICODE_COMBINING_CHAR (wc) &=
& WCWIDTH (wc) =3D=3D 0)
  ./lib/readline/rlmbutil.h:#define UNICODE_COMBINING_CHAR(x) ((x) >=3D 768=
 && (x) <=3D 879)
  ./lib/readline/rlmbutil.h:#  define WCWIDTH(wc) ((_rl_utf8locale && UNICO=
DE_COMBINING_CHAR(wc)) ? 0 : wcwidth(wc))

And sorry for not making this clear for those who never dealt with
the problem (which is probably not uncommon for filesystem or
other kernel hackers): `EastAsianWidth' refers to a property of
Unicode and ISO 10646:

  # EastAsianWidth-6.3.0.txt
  # Date: 2013-02-05, 20:09:00 GMT [KW, LI]
  #
  # East Asian Width Properties
  #
  # This file is an informative contributory data file in the
  # Unicode Character Database.
  #
  # Copyright (c) 1991-2013 Unicode, Inc.
  # For terms of use, see http://www.unicode.org/terms_of_use.html

--steffen

...
To be honest i must admit i first was pissed, so let me append the
original first part of this message, please:

  and so the landslide had brought it down.
  But i would quote Paul Vixie, who stated in a todays' message

    gentlemen and ladies, we have met the enemy, and they are our
    egos.

    vixie

  From my point of view it's the matter of culture and philosophy
  (including religion) how to deal with that very problem.
  And i can assure you that Jehovas Witnesses, which visit me
  regulary for some years now, like to drink a bit of my Buddhistic
  tea.

Paul Vixie is correct.
I am stupid.
With greetings from someone who will undergo his 42nd birthday soon




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 18:11:07 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 14:11:07 2014
Received: from localhost ([127.0.0.1]:44791 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WYJRJ-0002to-Fc
	for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 14:11:06 -0400
Received: from mpv2.tis.cwru.edu ([129.22.105.37]:11566)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <chet.ramey@HIDDEN>) id 1WYJRB-0002t1-3t
 for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 14:11:02 -0400
Received: from mpv6.tis.CWRU.Edu (EHLO mpv6.cwru.edu) ([129.22.104.221])
 by mpv2.tis.cwru.edu (MOS 4.3.5-GA FastPath queued)
 with ESMTP id BDG14791; Thu, 10 Apr 2014 14:10:39 -0400 (EDT)
Received: from caleb.INS.CWRU.Edu (EHLO caleb.ins.cwru.edu) ([129.22.8.211])
 by mpv6.cwru.edu (MOS 4.3.5-GA FastPath queued)
 with ESMTP id AJH10974 (AUTH cpr);
 Thu, 10 Apr 2014 14:10:29 -0400 (EDT)
Message-ID: <5346DE92.9020004@HIDDEN>
Date: Thu, 10 Apr 2014 14:10:26 -0400
From: Chet Ramey <chet.ramey@HIDDEN>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Steffen Nurpmeso <sdaoden@HIDDEN>, Rich Felker <dalias@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN>
 <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
 <20140410075610.GO26358@HIDDEN>
 <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
In-Reply-To: <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Junkmail-Status: score=10/50, host=mpv6.cwru.edu
X-Junkmail-Whitelist: YES (by domain whitelist at mpv2.tis.cwru.edu)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, chet.ramey@HIDDEN,
 =?ISO-8859-1?Q?P=E1draig_Brady?= <P@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>,
 Austin Group <austin-group-l@HIDDEN>, Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: chet.ramey@HIDDEN
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:

> Even better would nonetheless be the great picture with
> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
> a tuple of {[EastAsianWidth indication,] is-combining,
> width-if-non-combining} and best even some composition function.

But we have always been at war with EastAsia!

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@HIDDEN    http://cnswww.cns.cwru.edu/~chet/




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 16:16:33 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 12:16:33 2014
Received: from localhost ([127.0.0.1]:39942 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WYHeS-0000qA-6P
	for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 12:16:32 -0400
Received: from forward10l.mail.yandex.net ([84.201.143.143]:56157)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <sdaoden@HIDDEN>) id 1WYHeO-0000pt-BG
 for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 12:16:30 -0400
Received: from smtp4o.mail.yandex.net (smtp4o.mail.yandex.net [37.140.190.29])
 by forward10l.mail.yandex.net (Yandex) with ESMTP id 69B9EBA0CBD;
 Thu, 10 Apr 2014 20:16:21 +0400 (MSK)
Received: from smtp4o.mail.yandex.net (localhost [127.0.0.1])
 by smtp4o.mail.yandex.net (Yandex) with ESMTP id 9260123216B2;
 Thu, 10 Apr 2014 20:16:20 +0400 (MSK)
Received: from unknown (unknown [89.204.139.192])
 by smtp4o.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id r76UwoaVWs-GIC4GxRE; 
 Thu, 10 Apr 2014 20:16:19 +0400
 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
 (Client certificate not present)
X-Yandex-Uniq: 8f680bf8-3e00-4234-9a76-8cb266ba010c
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail;
 t=1397146580; bh=/2u0mCxLCUU2rUNpki+IBTdyRqHu0/oZLifGzqhY8Uw=;
 h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To:
 User-Agent:MIME-Version:Content-Type;
 b=bVCvIg9HuCudcBkgzpK3b/GVkA77j4sSkPZhCrjPHKqyP2QI7tzxBBk1vLIeaVLOY
 a738WypoLW5AAvhswwi8sgQasG2D7jxaRxcWmrgf/O0ErByQifzZ+WlUOJsLr7K9Ew
 rukIqnkwW0Se9MEtgLtjmdJ5jgZkXHypfa5U7Iho=
Authentication-Results: smtp4o.mail.yandex.net; dkim=pass header.i=@yandex.com
Date: Thu, 10 Apr 2014 18:16:24 +0200
From: Steffen Nurpmeso <sdaoden@HIDDEN>
To: Rich Felker <dalias@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN>
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
 <20140410075610.GO26358@HIDDEN>
In-Reply-To: <20140410075610.GO26358@HIDDEN>
User-Agent: s-nail v14.6.4-1-ga39836e
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Eric Blake <eblake@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>,
 Austin Group <austin-group-l@HIDDEN>,
 =?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

This is a multi-part message in MIME format.

--=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Rich Felker <dalias@HIDDEN> wrote:
 |On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
 |> Eric Blake <eblake@HIDDEN> wrote:
 |>|Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
 |>|and currently states that %Ls is undefined.  But I would LOVE to have a
 |>|standardized spelling for counting characters instead of bytes.  The
 |>|extension %Ls looks like a good candidate for standardization, precisel=
y
 |>|because counting characters when printing a multibyte string is more
 |>|useful than counting bytes (you do NOT want to end in the middle of a
 |>|multibyte character), and because ksh offers it as existing practice.
 |>|
 |>|Your idea for counting "cells" (by which I'm assuming you mean one or
 |>|more characters that all display within the same cell of the terminal,
 |>|as if the end user saw only one grapheme), on the other hand, does not
 |>|seem to have any precedence, and I would strongly object to having %s
 [.]
 |> I see you are trying to invent the word character for code points
 |> and reserve the term "graphem" for user-perceived characters.
 |> This goes in line with the GNU library which has the existing
 |> practice to let wcwidth(3) return the value 1 for accents and
 |> other combining code points as well as so-called (Unicode)
 |> noncharacters.  And who would call wcwidth(3) on something that is
 |> not to be drawn onto the screen directly afterwards.  And, of
 |> course, which terminal will perform the composition of code points
 |> written via STD I/O to characters on its own.
 |> I think for quite a while it is up to the input methods to combine
 |> into something precomposed in order to let POSIX programs finally
 |> work with it.
 |
 |Many languages do not have precomposed forms for all the character
 |sequences they need, and for some, it would not even be practical to
 |have precomposed forms, and would force the use of complex input
 |methods instead of simple keyboard maps.

And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be =C2=AB=C3=BC=C2=BB but also U+0076 U+0308 =C2=ABu =CC=88=C2=BB, dep=
endent on where it
came from.  And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at =C2=ABu=C2=BB plus, actually non-,
combining diaeresis.  (In fact actually =C2=ABcombining with a space=C2=BB.=
)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.

Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that =C2=ABuser-perceived characters don't have any
precedence=C2=BB.  A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.

--steffen

--=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_
Content-Type: message/rfc822
Content-Disposition: inline
Content-Description: Original message content

Received: from mxfront3h.mail.yandex.net ([127.0.0.1])
	by mxfront3h.mail.yandex.net with LMTP id uF50cqbZ
	for <sdaoden@HIDDEN>; Thu, 10 Apr 2014 11:56:15 +0400
Received: from 216-12-86-13.cv.mvl.ntelos.net (216-12-86-13.cv.mvl.ntelos.net [216.12.86.13])
	by mxfront3h.mail.yandex.net (nwsmtp/Yandex) with SMTP id rYLYuCwMqF-uEAeEXKa;
	Thu, 10 Apr 2014 11:56:14 +0400
X-Yandex-Uniq: 655f0aa3-4efb-4152-ab26-2bb01fe7b98d
Received: from dalias by brightrain.aerifal.cx with local (Exim 3.15 #2)
	id 1WY9qE-0005Ha-00; Thu, 10 Apr 2014 07:56:10 +0000
Date: Thu, 10 Apr 2014 03:56:10 -0400
To: Steffen Nurpmeso <sdaoden@HIDDEN>
Cc: Eric Blake <eblake@HIDDEN>, 17196 <at> debbugs.gnu.org,
	Austin Group <austin-group-l@HIDDEN>,
	Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>,
	=?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140410075610.GO26358@HIDDEN>
References: <53408EFF.7050601@HIDDEN>
 <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN>
 <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
User-Agent: Mutt/1.5.21 (2010-09-15)
From: Rich Felker <dalias@HIDDEN>
Return-Path: dalias@HIDDEN
X-Yandex-Forward: 1431d05c8f532bcc8fea61a74badcb33
Status: RO

On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <eblake@HIDDEN> wrote:
>  |>>   Dan Douglas wrote:
>  |>>> ksh93 already has this feature using the "L" modifier:
>  |>>> 
>  |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>  |>>> ★★★
>  |>>
>  |>> At least there is prior art for it.
>  |> 
>  |> So we can count bytes, chars or cells (graphemes).
>  |> 
>  |> Thinking a bit more about it, I think shell level printf
>  |> should be dealing in text of the current encoding and counting cells.
>  |> In the edge case where you want to deal in bytes one can do:
>  |>   LC_ALL=C printf ...
>  |> 
>  |> I see that ksh behaves as I would expect and counts cells,
>  |> though requires the explicit %L enabler:
>  |>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★★
>  |>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★
>  |>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>  |>   A
>  |> 
>  |> zsh seems to just count characters:
>  |>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★★
>  |> 
>  |> I see that dash gives invalid directive for any of %ls %Ls %S.
>  |> 
>  |> Pity there is no consensus here.
>  |> Personally I would go for:
>  |>   printf '%3s' 'blah'  # count cells
>  |>   printf '%3Ls' 'blah' # count chars
>  |>   LANG=C '%3Ls' 'blah' # count bytes
>  |>   LANG=C '%3s' 'blah'  # count bytes
>  |
>  |Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
>  |and currently states that %Ls is undefined.  But I would LOVE to have a
>  |standardized spelling for counting characters instead of bytes.  The
>  |extension %Ls looks like a good candidate for standardization, precisely
>  |because counting characters when printing a multibyte string is more
>  |useful than counting bytes (you do NOT want to end in the middle of a
>  |multibyte character), and because ksh offers it as existing practice.
>  |
>  |Your idea for counting "cells" (by which I'm assuming you mean one or
>  |more characters that all display within the same cell of the terminal,
>  |as if the end user saw only one grapheme), on the other hand, does not
>  |seem to have any precedence, and I would strongly object to having %s
>  |count by cells because %s already has a standardized (if unfortunate)
>  |meaning of counting by bytes.  Maybe yet another extension is warranted
>  |(perhaps %LLs?) as a new notion for counting by cells instead of
>  |characters, but it's harder to justify that without existing practice.
> 
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters.  And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards.  And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.

Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.

Rich


--=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_--




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 07:56:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 03:56:24 2014
Received: from localhost ([127.0.0.1]:39544 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WY9qQ-0001JU-UV
	for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 03:56:23 -0400
Received: from 216-12-86-13.cv.mvl.ntelos.net ([216.12.86.13]:44012
 helo=brightrain.aerifal.cx) by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <dalias@HIDDEN>) id 1WY9qN-0001JE-H0
 for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 03:56:20 -0400
Received: from dalias by brightrain.aerifal.cx with local (Exim 3.15 #2)
 id 1WY9qE-0005Ha-00; Thu, 10 Apr 2014 07:56:10 +0000
Date: Thu, 10 Apr 2014 03:56:10 -0400
To: Steffen Nurpmeso <sdaoden@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140410075610.GO26358@HIDDEN>
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
 <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
User-Agent: Mutt/1.5.21 (2010-09-15)
From: Rich Felker <dalias@HIDDEN>
X-Spam-Score: 0.4 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Eric Blake <eblake@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>,
 Austin Group <austin-group-l@HIDDEN>,
 =?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.4 (/)

On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <eblake@HIDDEN> wrote:
>  |>>   Dan Douglas wrote:
>  |>>> ksh93 already has this feature using the "L" modifier:
>  |>>> 
>  |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>  |>>> ★★★
>  |>>
>  |>> At least there is prior art for it.
>  |> 
>  |> So we can count bytes, chars or cells (graphemes).
>  |> 
>  |> Thinking a bit more about it, I think shell level printf
>  |> should be dealing in text of the current encoding and counting cells.
>  |> In the edge case where you want to deal in bytes one can do:
>  |>   LC_ALL=C printf ...
>  |> 
>  |> I see that ksh behaves as I would expect and counts cells,
>  |> though requires the explicit %L enabler:
>  |>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★★
>  |>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★
>  |>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>  |>   A
>  |> 
>  |> zsh seems to just count characters:
>  |>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★★
>  |> 
>  |> I see that dash gives invalid directive for any of %ls %Ls %S.
>  |> 
>  |> Pity there is no consensus here.
>  |> Personally I would go for:
>  |>   printf '%3s' 'blah'  # count cells
>  |>   printf '%3Ls' 'blah' # count chars
>  |>   LANG=C '%3Ls' 'blah' # count bytes
>  |>   LANG=C '%3s' 'blah'  # count bytes
>  |
>  |Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
>  |and currently states that %Ls is undefined.  But I would LOVE to have a
>  |standardized spelling for counting characters instead of bytes.  The
>  |extension %Ls looks like a good candidate for standardization, precisely
>  |because counting characters when printing a multibyte string is more
>  |useful than counting bytes (you do NOT want to end in the middle of a
>  |multibyte character), and because ksh offers it as existing practice.
>  |
>  |Your idea for counting "cells" (by which I'm assuming you mean one or
>  |more characters that all display within the same cell of the terminal,
>  |as if the end user saw only one grapheme), on the other hand, does not
>  |seem to have any precedence, and I would strongly object to having %s
>  |count by cells because %s already has a standardized (if unfortunate)
>  |meaning of counting by bytes.  Maybe yet another extension is warranted
>  |(perhaps %LLs?) as a new notion for counting by cells instead of
>  |characters, but it's harder to justify that without existing practice.
> 
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters.  And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards.  And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.

Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.

Rich




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 9 Apr 2014 15:47:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 09 11:47:24 2014
Received: from localhost ([127.0.0.1]:39200 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WXuig-0005dD-In
	for submit <at> debbugs.gnu.org; Wed, 09 Apr 2014 11:47:23 -0400
Received: from forward4l.mail.yandex.net ([84.201.143.137]:47067)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <sdaoden@HIDDEN>) id 1WXrwp-00010T-Jg
 for 17196 <at> debbugs.gnu.org; Wed, 09 Apr 2014 08:49:48 -0400
Received: from smtp1h.mail.yandex.net (smtp1h.mail.yandex.net [84.201.187.144])
 by forward4l.mail.yandex.net (Yandex) with ESMTP id A5BE81441127;
 Wed,  9 Apr 2014 16:49:39 +0400 (MSK)
Received: from smtp1h.mail.yandex.net (localhost [127.0.0.1])
 by smtp1h.mail.yandex.net (Yandex) with ESMTP id B63851340F6C;
 Wed,  9 Apr 2014 16:49:38 +0400 (MSK)
Received: from unknown (unknown [82.113.106.166])
 by smtp1h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id WBC3dR9mYn-naD4f4Cf; 
 Wed,  9 Apr 2014 16:49:37 +0400
 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
 (Client certificate not present)
X-Yandex-Uniq: a0da012c-a10d-40b9-bc00-e1c953c90020
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail;
 t=1397047777; bh=CIw9qbQeAYBBogQWUXflnowQwxxIj8pfP4P/KUAWE2g=;
 h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To:
 User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding;
 b=pBmJsEEBPEZU+9UxD87VZlFFasK4VUWKRpiwP1g+mz4W283R/aJarhrLG5STNS1TU
 GXyTUrA9CwVw5K6khosOA3krKyIWUPzmP7blmBxi0GdXWDhrk4gHU2gRXt5J7hz8ea
 qsq+F4t0/cVps584v90Jv8hDIyaPVLRzhhcpiaxc=
Authentication-Results: smtp1h.mail.yandex.net; dkim=pass header.i=@yandex.com
Date: Wed, 09 Apr 2014 14:49:37 +0200
From: Steffen Nurpmeso <sdaoden@HIDDEN>
To: Eric Blake <eblake@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN>
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
In-Reply-To: <53431F2F.8060701@HIDDEN>
User-Agent: s-nail v14.6.4-1-ga39836e
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
X-Mailman-Approved-At: Wed, 09 Apr 2014 11:47:20 -0400
Cc: 17196 <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Austin Group <austin-group-l@HIDDEN>,
 Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Eric Blake <eblake@HIDDEN> wrote:
 |>>   Dan Douglas wrote:
 |>>> ksh93 already has this feature using the "L" modifier:
 |>>>=20
 |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
 |>>> =E2=98=85=E2=98=85=E2=98=85
 |>>
 |>> At least there is prior art for it.
 |>=20
 |> So we can count bytes, chars or cells (graphemes).
 |>=20
 |> Thinking a bit more about it, I think shell level printf
 |> should be dealing in text of the current encoding and counting cells.
 |> In the edge case where you want to deal in bytes one can do:
 |>   LC_ALL=3DC printf ...
 |>=20
 |> I see that ksh behaves as I would expect and counts cells,
 |> though requires the explicit %L enabler:
 |>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
 |>   a=CC=81=E2=98=85=E2=98=85
 |>   $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'"
 |>   =EF=BC=A1=E2=98=85
 |>   $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1=EF=BC=A1\u2605\u2605\u2605'"
 |>   =EF=BC=A1
 |>=20
 |> zsh seems to just count characters:
 |>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
 |>   a=CC=81=E2=98=85
 |>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
 |>   a=CC=81=E2=98=85
 |>   $ zsh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'"
 |>   =EF=BC=A1=E2=98=85=E2=98=85
 |>=20
 |> I see that dash gives invalid directive for any of %ls %Ls %S.
 |>=20
 |> Pity there is no consensus here.
 |> Personally I would go for:
 |>   printf '%3s' 'blah'  # count cells
 |>   printf '%3Ls' 'blah' # count chars
 |>   LANG=3DC '%3Ls' 'blah' # count bytes
 |>   LANG=3DC '%3s' 'blah'  # count bytes
 |
 |Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
 |and currently states that %Ls is undefined.  But I would LOVE to have a
 |standardized spelling for counting characters instead of bytes.  The
 |extension %Ls looks like a good candidate for standardization, precisely
 |because counting characters when printing a multibyte string is more
 |useful than counting bytes (you do NOT want to end in the middle of a
 |multibyte character), and because ksh offers it as existing practice.
 |
 |Your idea for counting "cells" (by which I'm assuming you mean one or
 |more characters that all display within the same cell of the terminal,
 |as if the end user saw only one grapheme), on the other hand, does not
 |seem to have any precedence, and I would strongly object to having %s
 |count by cells because %s already has a standardized (if unfortunate)
 |meaning of counting by bytes.  Maybe yet another extension is warranted
 |(perhaps %LLs?) as a new notion for counting by cells instead of
 |characters, but it's harder to justify that without existing practice.

I see you are trying to invent the word character for code points
and reserve the term "graphem" for user-perceived characters.
This goes in line with the GNU library which has the existing
practice to let wcwidth(3) return the value 1 for accents and
other combining code points as well as so-called (Unicode)
noncharacters.  And who would call wcwidth(3) on something that is
not to be drawn onto the screen directly afterwards.  And, of
course, which terminal will perform the composition of code points
written via STD I/O to characters on its own.
I think for quite a while it is up to the input methods to combine
into something precomposed in order to let POSIX programs finally
work with it.

--steffen




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 8 Apr 2014 01:28:18 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 21:28:18 2014
Received: from localhost ([127.0.0.1]:40037 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WXKpl-0004I6-Kj
	for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 21:28:18 -0400
Received: from mx1.redhat.com ([209.132.183.28]:19405)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eblake@HIDDEN>) id 1WXKph-0004Hu-Rm
 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 21:28:15 -0400
Received: from int-mx13.intmail.prod.int.phx2.redhat.com
 (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
 by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s381SBwj003254
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
 Mon, 7 Apr 2014 21:28:12 -0400
Received: from [10.3.113.181] (ovpn-113-181.phx2.redhat.com [10.3.113.181])
 by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 s381SAUH012178; Mon, 7 Apr 2014 21:28:10 -0400
Message-ID: <534350AA.2050803@HIDDEN>
Date: Mon, 07 Apr 2014 19:28:10 -0600
From: Eric Blake <eblake@HIDDEN>
Organization: Red Hat, Inc.
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN>
 <53412952.1040506@HIDDEN>	<20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
 <53433EA1.4010204@HIDDEN>
In-Reply-To: <53433EA1.4010204@HIDDEN>
X-Enigmail-Version: 1.6
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe"
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26
X-Spam-Score: -5.3 (-----)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.3 (-----)

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 04/07/2014 06:11 PM, P=C3=A1draig Brady wrote:

>=20
> If we had to make it explicit for backwards compat reasons,
> then I suppose counting by characters is the least useful,
> so we could just standardize the existing ksh behavior and have:
>=20
>    printf '%3s' 'blah'  # count bytes
>    printf '%3Ls' 'blah' # count cells
>    LANG=3DC '%3Ls' 'blah' # count bytes

If we add %3Ls to the shell, we should also add it to libc's printf(3),
which means coordinating with the C committee.

>=20
> This has the disadvantage of not degrading gracefully
> on dash for example where %Ls is rejected.

If a future version of the standard mandates behavior for %Ls, I suspect
dash would be made compliant fairly quickly - the dash maintainers
strive hard to comply with POSIX.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJTQ1CqAAoJEKeha0olJ0Nq1fMH/iocyOefBelzJjRFQe9OpSZH
U4Od8i/T8FNt+2kaUbaYud8Hq7hlciSdp1vbB1GFur89qQ9hH5fzvQMEdZyhaazx
Rurfq8nT1hBjUkNbbb60TYovJY71Pqkmuop32BrmpwYNoM/K2cthcHD9RO7djXQ0
lN/zAEFtrs7/ETJT2/FrieIBci98bCjggEMQ15rbkpTPZ6sWJLk03aHqpDZKQ/+j
8GD7fZJwCKWV4g3Rn13Qc+enT9Wnxx1L5Y+6P5fGbx7pxPD6mK3pUmyCewwjFong
iKM9H7fb2iUaWphMlefooeWhnvtvb38E9Srm78N0ZQsIH/iMbTknOfT07I5mw48=
=XKN5
-----END PGP SIGNATURE-----

--WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe--




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 8 Apr 2014 00:11:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 20:11:23 2014
Received: from localhost ([127.0.0.1]:40018 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WXJdG-0002J3-TE
	for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 20:11:23 -0400
Received: from mail2.vodafone.ie ([213.233.128.44]:3379)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <P@HIDDEN>) id 1WXJdE-0002Iu-KO
 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 20:11:17 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApUBAHY9Q1NtTJL0/2dsb2JhbAANTINBg2G5bIc3gT2DGQEBAQMBAQIgDwFGBQsJAg0BCgICBRYLAgIJAwIBAgEWLwYNAQcBAYdtDQiMc5sidqIwF4EpjUgHgm+BSQEDlgSEC4VFjnc
Received: from unknown (HELO [192.168.1.79]) ([109.76.146.244])
 by mail2.vodafone.ie with ESMTP; 08 Apr 2014 01:11:14 +0100
Message-ID: <53433EA1.4010204@HIDDEN>
Date: Tue, 08 Apr 2014 01:11:13 +0100
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Eric Blake <eblake@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN>
 <53412952.1040506@HIDDEN>	<20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN>
In-Reply-To: <53431F2F.8060701@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>,
 Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
> 
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>>   Dan Douglas wrote:
>>>   > ksh93 already has this feature using the "L" modifier:
>>>   > 
>>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>>   > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>>   LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★★
>>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>>   A★
>>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>>   A
>>
>> zsh seems to just count characters:
>>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>>   A★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>>   printf '%3s' 'blah'  # count cells
>>   printf '%3Ls' 'blah' # count chars
>>   LANG=C '%3Ls' 'blah' # count bytes
>>   LANG=C '%3s' 'blah'  # count bytes
> 
> Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined.  But I would LOVE to have a
> standardized spelling for counting characters instead of bytes.  The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.

Note ksh seems to count cells with %Ls

> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes.  Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.

At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

But it's a fair point that there may be scripts
that don't consider the zsh behavior.

If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:

   printf '%3s' 'blah'  # count bytes
   printf '%3Ls' 'blah' # count cells
   LANG=C '%3Ls' 'blah' # count bytes

This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

thanks,
Pádraig.




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 21:57:11 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 17:57:11 2014
Received: from localhost ([127.0.0.1]:39976 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WXHXS-0007Jn-5M
	for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:57:10 -0400
Received: from mx1.redhat.com ([209.132.183.28]:50461)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eblake@HIDDEN>) id 1WXHXO-0007Jd-Is
 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:57:08 -0400
Received: from int-mx02.intmail.prod.int.phx2.redhat.com
 (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
 by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s37Lv4Hw005827
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
 Mon, 7 Apr 2014 17:57:05 -0400
Received: from [10.3.113.181] (ovpn-113-181.phx2.redhat.com [10.3.113.181])
 by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id
 s37Lv3Y2001250; Mon, 7 Apr 2014 17:57:04 -0400
Message-ID: <53431F2F.8060701@HIDDEN>
Date: Mon, 07 Apr 2014 15:57:03 -0600
From: Eric Blake <eblake@HIDDEN>
Organization: Red Hat, Inc.
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>,
 Bob Proulx <bob@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN>
 <53412952.1040506@HIDDEN>	<20140406182447.GA1381@HIDDEN>
 <5342A337.9000407@HIDDEN>
In-Reply-To: <5342A337.9000407@HIDDEN>
X-Enigmail-Version: 1.6
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8"
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12
X-Spam-Score: -5.3 (-----)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>,
 Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.3 (-----)

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

[adding the Austin Group]

On 04/07/2014 07:08 AM, P=C3=A1draig Brady wrote:
> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>> P=C3=A1draig Brady wrote:
>>> Yes printf follows the C standard which only considers bytes.
>>> ...
>>> I don't think we'd be able to change the current operation of printf
>>> due to backwards compat reasons? Though we might be able to somehow l=
everage
>>> the existing multibyte character aware alignment/truncation code in:
>>> http://git.sv.gnu.org/gitweb/?p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/m=
bsalign.c;hb=3DHEAD
>>
>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>> that ksh uses the L modifier.
>>
>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>
>>   Dan Douglas wrote:
>>   > ksh93 already has this feature using the "L" modifier:
>>   >=20
>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>   > =E2=98=85=E2=98=85=E2=98=85
>>
>> At least there is prior art for it.
>=20
> So we can count bytes, chars or cells (graphemes).
>=20
> Thinking a bit more about it, I think shell level printf
> should be dealing in text of the current encoding and counting cells.
> In the edge case where you want to deal in bytes one can do:
>   LC_ALL=3DC printf ...
>=20
> I see that ksh behaves as I would expect and counts cells,
> though requires the explicit %L enabler:
>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>   a=CC=81=E2=98=85=E2=98=85
>   $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'"
>   =EF=BC=A1=E2=98=85
>   $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1=EF=BC=A1\u2605\u2605\u2605'"
>   =EF=BC=A1
>=20
> zsh seems to just count characters:
>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>   a=CC=81=E2=98=85
>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>   a=CC=81=E2=98=85
>   $ zsh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'"
>   =EF=BC=A1=E2=98=85=E2=98=85
>=20
> I see that dash gives invalid directive for any of %ls %Ls %S.
>=20
> Pity there is no consensus here.
> Personally I would go for:
>   printf '%3s' 'blah'  # count cells
>   printf '%3Ls' 'blah' # count chars
>   LANG=3DC '%3Ls' 'blah' # count bytes
>   LANG=3DC '%3s' 'blah'  # count bytes

Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined.  But I would LOVE to have a
standardized spelling for counting characters instead of bytes.  The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.

Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes.  Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJTQx8vAAoJEKeha0olJ0NqbWkH/AtqespL088wPpB5djiIJwc6
L4oyBo3wMGOdB3XIV4eeJzGm9shYMA9aVw+8y1VH/5xTi52FqTmy0EkVsJ/nDrb0
ZU3OyXQC5U5s/ufcgY5oIo0IBVSduetbR0rgG1/I7rNyqiLV0+AK5RJcwDcAxmaT
5mhrpYMnKHIhDwKBlZ+Fm224o8jDHvg46C7R2XmHCAQ5ayKfw6mMYqyyup0pHDyO
/Bu8dhdLmIsj+prRw5JkqvyEO1gfo0rJC005kktqD4zr3NWpkwDSG7O8CAW67ZMV
G305iLrgEkr6knbmLt/BjDci6OyPvmNqSYataieBWkmUKoYl4GPjfY9sQsi93Fw=
=vBNo
-----END PGP SIGNATURE-----

--IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8--




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 21:41:11 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 17:41:11 2014
Received: from localhost ([127.0.0.1]:39963 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WXHHy-0006uX-Th
	for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:41:11 -0400
Received: from smtp1.gts.sk ([195.168.0.153]:49961 helo=smtp5.gts.sk)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <jn@HIDDEN>) id 1WXHHv-0006uJ-H1
 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:41:08 -0400
Received: from localhost (localhost [127.0.0.1])
 by smtp5.gts.sk (Postfix) with ESMTP id EBF68E805D;
 Mon,  7 Apr 2014 23:41:05 +0200 (CEST)
X-Virus-Scanned: amavisd-new at nextra.sk
Received: from smtp5.gts.sk ([195.168.0.153])
 by localhost (smtp.gts.sk [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 9YEBQzx29SJh; Mon,  7 Apr 2014 23:41:04 +0200 (CEST)
Received: from [10.1.2.4] (188-167-225-220.dynamic.chello.sk [188.167.225.220])
 (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: nkame@HIDDEN)
 by smtp5.gts.sk (Postfix) with ESMTPSA id 352F0E8006;
 Mon,  7 Apr 2014 23:41:04 +0200 (CEST)
Message-ID: <53431B6F.1040108@HIDDEN>
Date: Mon, 07 Apr 2014 23:41:03 +0200
From: Jan Novak <jn@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>, 
 Bob Proulx <bob@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN>
In-Reply-To: <5342A337.9000407@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

Pádraig Brady wrote:
> Pity there is no consensus here.
> Personally I would go for:
>    printf '%3s' 'blah'  # count cells
>    printf '%3Ls' 'blah' # count chars
>    LANG=C '%3Ls' 'blah' # count bytes
>    LANG=C '%3s' 'blah'  # count bytes

I vote for it ...
it is excellent idea, that "standard" notation works properly in localized environment !
(because this is exactly what users expect)

Thanks !
novak




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 13:08:13 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 09:08:13 2014
Received: from localhost ([127.0.0.1]:38921 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WX9HY-0000N1-MP
	for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 09:08:13 -0400
Received: from mail2.vodafone.ie ([213.233.128.44]:10186)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <P@HIDDEN>) id 1WX9HV-0000Mp-MS
 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 09:08:10 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApUBAFCiQlNtTJL0/2dsb2JhbAANTINBg2G5WYc3gTeDGQEBAQQBAiAPAUYQCQINCwICBRYLAgIJAwIBAgEWLwYNAQcBAYd6CI0JmyJ2oiAXgSmNSAeCb4FJAQOWBIQLhUWOdw
Received: from unknown (HELO [192.168.1.79]) ([109.76.146.244])
 by mail2.vodafone.ie with ESMTP; 07 Apr 2014 14:08:08 +0100
Message-ID: <5342A337.9000407@HIDDEN>
Date: Mon, 07 Apr 2014 14:08:07 +0100
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Bob Proulx <bob@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
 <20140406182447.GA1381@HIDDEN>
In-Reply-To: <20140406182447.GA1381@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org, Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

On 04/06/2014 07:24 PM, Bob Proulx wrote:
> Pádraig Brady wrote:
>> Yes printf follows the C standard which only considers bytes.
>> ...
>> I don't think we'd be able to change the current operation of printf
>> due to backwards compat reasons? Though we might be able to somehow leverage
>> the existing multibyte character aware alignment/truncation code in:
>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
> 
> Dan Douglas pointed out in the corresponding discussion in bug-bash
> that ksh uses the L modifier.
> 
>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
> 
>   Dan Douglas wrote:
>   > ksh93 already has this feature using the "L" modifier:
>   > 
>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>   > ★★★
> 
> At least there is prior art for it.

So we can count bytes, chars or cells (graphemes).

Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
In the edge case where you want to deal in bytes one can do:
  LC_ALL=C printf ...

I see that ksh behaves as I would expect and counts cells,
though requires the explicit %L enabler:
  $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
  á★★
  $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
  A★
  $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
  A

zsh seems to just count characters:
  $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
  á★
  $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
  á★
  $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
  A★★

I see that dash gives invalid directive for any of %ls %Ls %S.

Pity there is no consensus here.
Personally I would go for:
  printf '%3s' 'blah'  # count cells
  printf '%3Ls' 'blah' # count chars
  LANG=C '%3Ls' 'blah' # count bytes
  LANG=C '%3s' 'blah'  # count bytes

Pádraig.





Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 18:24:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 14:24:53 2014
Received: from localhost ([127.0.0.1]:38329 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WWrkS-0007s0-9M
	for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:24:52 -0400
Received: from joseki.proulx.com ([216.17.153.58]:48570)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <bob@HIDDEN>) id 1WWrkO-0007rk-QD
 for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:24:50 -0400
Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119])
 by joseki.proulx.com (Postfix) with ESMTP id 9224721233;
 Sun,  6 Apr 2014 12:24:47 -0600 (MDT)
Received: by hysteria.proulx.com (Postfix, from userid 1000)
 id 62F292DC9A; Sun,  6 Apr 2014 12:24:47 -0600 (MDT)
Date: Sun, 6 Apr 2014 12:24:47 -0600
From: Bob Proulx <bob@HIDDEN>
To: 17196 <at> debbugs.gnu.org
Subject: Re: bug#17196: UTF-8 printf string formating  problem
Message-ID: <20140406182447.GA1381@HIDDEN>
References: <53408EFF.7050601@HIDDEN>
 <53412952.1040506@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <53412952.1040506@HIDDEN>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Score: -0.3 (/)
X-Debbugs-Envelope-To: 17196
Cc: Jan Novak <jn@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.3 (/)

Pádraig Brady wrote:
> Yes printf follows the C standard which only considers bytes.
> ...
> I don't think we'd be able to change the current operation of printf
> due to backwards compat reasons? Though we might be able to somehow leverage
> the existing multibyte character aware alignment/truncation code in:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD

Dan Douglas pointed out in the corresponding discussion in bug-bash
that ksh uses the L modifier.

  http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html

  Dan Douglas wrote:
  > ksh93 already has this feature using the "L" modifier:
  > 
  > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
  > ★★★

At least there is prior art for it.

Bob




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 18:13:26 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 14:13:26 2014
Received: from localhost ([127.0.0.1]:38323 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WWrZN-0007Yy-TT
	for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:13:26 -0400
Received: from mail1.vodafone.ie ([213.233.128.43]:63816)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <P@HIDDEN>) id 1WWrZL-0007Yl-De
 for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:13:24 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApQBALyYQVNtT6Td/2dsb2JhbAANS4civX+DDoErgxkBAQEEIw8BRhALDQEKAgIFFgsCAgkDAgECAUUGDQEHAQEXh2OoSXaiFReBKY1IB4JvgUkBA59Ujnc
Received: from unknown (HELO [192.168.1.79]) ([109.79.164.221])
 by mail1.vodafone.ie with ESMTP; 06 Apr 2014 19:13:21 +0100
Message-ID: <53419941.7090105@HIDDEN>
Date: Sun, 06 Apr 2014 19:13:21 +0100
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Jan Novak <jn@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN>
In-Reply-To: <53412952.1040506@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

On 04/06/2014 11:15 AM, Pádraig Brady wrote:
> On 04/06/2014 12:17 AM, Jan Novak wrote:
>> Hello,
>>
>> printf string format counts bytes instead of chars, which leads to broken output ...
>> (the same problem occurs with bash built in printf)
>>
>>
>> just try this:
>>
>> $ echo $LANG
>> us_US.UTF-8
>>
>>
>> $ printf "|%3s|\n" "a"
>> |  a|
>>
>> $ printf "|%3s|\n" "á"     (char is a-acute)
>> | á|
>>
>> expected output:
>> |  á|
>>
>> Is there some easy solution ?
>>
>> TIA for the answer
> 
> Yes printf follows the C standard which only considers bytes.
> awk does respect characters in width specifiers though:
> 
>   $ awk 'BEGIN{printf "|%3s|\n", "á"}'
>   |  á|

Jan points out to me the the awk solution is not portable
to mawk 1.3.3 at least. I used GNU Awk 3.1.8 above.

Pádraig.





Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at 17196 <at> debbugs.gnu.org:


Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 10:15:50 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 06:15:50 2014
Received: from localhost ([127.0.0.1]:37447 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WWk7C-0007Rr-2C
	for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 06:15:50 -0400
Received: from mail1.vodafone.ie ([213.233.128.43]:17840)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <P@HIDDEN>) id 1WWk79-0007Rf-Q1
 for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 06:15:48 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApQBAEUnQVNtT6Td/2dsb2JhbAANS4NBg2HBBoErgxkBAQEEIw8BRhALDQEKAgIFFgsCAgkDAgECAUUGDQEHAQEXh2MIqg12oXoXgSmNSAeCb4FJAQOfVI53
Received: from unknown (HELO [192.168.1.79]) ([109.79.164.221])
 by mail1.vodafone.ie with ESMTP; 06 Apr 2014 11:15:45 +0100
Message-ID: <53412952.1040506@HIDDEN>
Date: Sun, 06 Apr 2014 11:15:46 +0100
From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Jan Novak <jn@HIDDEN>
Subject: Re: bug#17196: UTF-8 printf string formating  problem
References: <53408EFF.7050601@HIDDEN>
In-Reply-To: <53408EFF.7050601@HIDDEN>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 17196
Cc: 17196 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

On 04/06/2014 12:17 AM, Jan Novak wrote:
> Hello,
> 
> printf string format counts bytes instead of chars, which leads to broken output ...
> (the same problem occurs with bash built in printf)
> 
> 
> just try this:
> 
> $ echo $LANG
> us_US.UTF-8
> 
> 
> $ printf "|%3s|\n" "a"
> |  a|
> 
> $ printf "|%3s|\n" "á"     (char is a-acute)
> | á|
> 
> expected output:
> |  á|
> 
> Is there some easy solution ?
> 
> TIA for the answer

Yes printf follows the C standard which only considers bytes.
awk does respect characters in width specifiers though:

  $ awk 'BEGIN{printf "|%3s|\n", "á"}'
  |  á|

I don't think we'd be able to change the current operation of printf
due to backwards compat reasons? Though we might be able to somehow leverage
the existing multibyte character aware alignment/truncation code in:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD

thanks,
Pádraig.




Information forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 5 Apr 2014 23:21:34 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Apr 05 19:21:34 2014
Received: from localhost ([127.0.0.1]:37178 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WWZu1-0003Dh-7N
	for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:21:33 -0400
Received: from eggs.gnu.org ([208.118.235.92]:40757)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <jn@HIDDEN>) id 1WWZqP-000375-Bi
 for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <jn@HIDDEN>) id 1WWZqF-00042O-7k
 for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:49 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: *
X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_40,
 RCVD_IN_BL_SPAMCOP_NET autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:57101)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>)
 id 1WWZqF-00042K-4T
 for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:39 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42472)
 by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>)
 id 1WWZq7-0001j1-KZ
 for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <jn@HIDDEN>) id 1WWZq0-00041i-6a
 for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:31 -0400
Received: from smtp1.gts.sk ([195.168.0.153]:52608 helo=smtp5.gts.sk)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>)
 id 1WWZpz-00041S-VS
 for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:24 -0400
Received: from localhost (localhost [127.0.0.1])
 by smtp5.gts.sk (Postfix) with ESMTP id E9920E8069
 for <bug-coreutils@HIDDEN>; Sun,  6 Apr 2014 01:17:20 +0200 (CEST)
X-Virus-Scanned: amavisd-new at nextra.sk
Received: from smtp5.gts.sk ([195.168.0.153])
 by localhost (smtp.gts.sk [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id FCLwGCwX2sYd for <bug-coreutils@HIDDEN>;
 Sun,  6 Apr 2014 01:17:19 +0200 (CEST)
Received: from [10.1.2.4] (188-167-225-220.dynamic.chello.sk [188.167.225.220])
 (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: nkame@HIDDEN)
 by smtp5.gts.sk (Postfix) with ESMTPSA id 6C90DE807B
 for <bug-coreutils@HIDDEN>; Sun,  6 Apr 2014 01:17:19 +0200 (CEST)
Message-ID: <53408EFF.7050601@HIDDEN>
Date: Sun, 06 Apr 2014 01:17:19 +0200
From: Jan Novak <jn@HIDDEN>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: bug-coreutils@HIDDEN
Subject: UTF-8 printf string formating  problem
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -2.8 (--)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Sat, 05 Apr 2014 19:21:31 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.8 (--)

Hello,

printf string format counts bytes instead of chars, which leads to broken=
 output ...
(the same problem occurs with bash built in printf)


just try this:

$ echo $LANG
us_US.UTF-8


$ printf "|%3s|\n" "a"
|  a|

$ printf "|%3s|\n" "=C3=A1"     (char is a-acute)
| =C3=A1|

expected output:
|  =C3=A1|

Is there some easy solution ?

TIA for the answer


Best regards
Novak




Acknowledgement sent to Jan Novak <jn@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to bug-coreutils@HIDDEN:
bug#17196; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.