Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 9 May 2014 02:16:55 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu May 08 22:16:55 2014 Received: from localhost ([127.0.0.1]:56448 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WiaMo-0003zq-NU for submit <at> debbugs.gnu.org; Thu, 08 May 2014 22:16:55 -0400 Received: from nm10-vm0.bullet.mail.bf1.yahoo.com ([98.139.213.147]:21407) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <lsatenstein@HIDDEN>) id 1WiaMk-0003zX-Sj for 17196 <at> debbugs.gnu.org; Thu, 08 May 2014 22:16:52 -0400 Received: from [98.139.212.153] by nm10.bullet.mail.bf1.yahoo.com with NNFMP; 09 May 2014 02:16:45 -0000 Received: from [98.139.212.238] by tm10.bullet.mail.bf1.yahoo.com with NNFMP; 09 May 2014 02:16:45 -0000 Received: from [127.0.0.1] by omp1047.mail.bf1.yahoo.com with NNFMP; 09 May 2014 02:16:45 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 366228.35034.bm@HIDDEN Received: (qmail 10977 invoked by uid 60001); 9 May 2014 02:16:45 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1399601805; bh=s2lLlN9FNGWVV+Ys4HgZtPKK5Yo/Je4MtFyMdDot+Cg=; h=References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=iUTBJytk3qfsoXNT1/ZssEUkHrFaV8vbdaD3BrOWlWSEaZcdIfWkvKw33wlp6LYbjB6PjUFZjJaxqcp+515qfQd2QIG07mPjaMihbKdq23Rdw/rJTd3xyHkYBhR1b/3Pv+QzQPifA60WRVr9BMOTQYAcDd5E9MYSn6+PLDaLPaQ= X-YMail-OSG: kyO7YnUVM1lmX3HJKnrvaCeLXs.fewIl6ziBVKZGmLuxukr RZA5_MvUG4pBIByhIg_C0of8q8uVZCJJRsleHLSvMLYZ5702uNu9rfc.C9Ju ETVQK5NKkGjwFWi.o4uhyNEJtILw985MGWW3gdgjbMdLJQUJvrIuzCQdf2o0 eygarJdIkhwccj_bRLQRzO40nj_NTKqM72f1naaaC0d.9GKOANyipWBwwmxE _aRwqxoH8wJZILB6suB08ILFQlU.MiU105or8kn9ctnRNup5Q6k06MzeMb8P _sMSWdM11vYHmQxxXjU_f_q6FGWxBQYDijzTc97bYX2gMiIeJvVCRwTyr7Fz MZJgFZVM7RY8M6YAT7QUtMbfDj7P42d_OGTk2e.YOEJjsl0NR4zyCNqVgIyz s8D3sQ__BaExDXnXflngUJsAiTP1lWfJLt51ITcYU1LjIBsO2jmgFoPcgbdL nZY1M54Hi7H3dSfT6q17iGrMsNw4ZXRxJojCGHb2lqc.RnWDGwlgYw1CjC6j FKBGtJdmt4HODujXKiG1nVx5yvlH51gVqqCqIGIzL9CSdFg-- Received: from [70.49.120.43] by web142606.mail.bf1.yahoo.com via HTTP; Thu, 08 May 2014 19:16:45 PDT X-Rocket-MIMEInfo: 002.001, UGVyaGFwcyBwcmludGYoKSBuZWVkcyBzb21lIHdpZGUgY2hhcmFjdGVyIGV4dGVuc2lvbnMgdmlhICVuZXcgY2hhcmFjdGVycwoKwqAKUmVnYXJkcyAKCsKgTGVzbGllCgpNci4gTGVzbGllIFNhdGVuc3RlaW4KU0VOVCBGUk9NIE1ZIE9QRU4gU09VUkNFIExJTlVYIFNZU1RFTS4KCgoKCj5fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwo.IEZyb206IFDDoWRyYWlnIEJyYWR5IDxQQGRyYWlnQnJhZHkuY29tPgo.VG86IEphbiBOb3ZhayA8am5AdHVyYm8uc2s.IAo.Q2M6IDE3MTk2QGRlYmJ1Z3MuZ24BMAEBAQE- X-Mailer: YahooMailWebService/0.8.188.663 References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> Message-ID: <1399601805.73330.YahooMailNeo@HIDDEN> Date: Thu, 8 May 2014 19:16:45 -0700 (PDT) From: Leslie S Satenstein <lsatenstein@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem To: Jan Novak <jn@HIDDEN> In-Reply-To: <53412952.1040506@HIDDEN> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="562241088-351124307-1399601805=:73330" X-Spam-Score: -0.6 (/) X-Debbugs-Envelope-To: 17196 Cc: "17196 <at> debbugs.gnu.org" <17196 <at> debbugs.gnu.org> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: Leslie S Satenstein <lsatenstein@HIDDEN> List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.6 (/) --562241088-351124307-1399601805=:73330 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Perhaps printf() needs some wide character extensions via %new characters= =0A=0A=A0=0ARegards =0A=0A=A0Leslie=0A=0AMr. Leslie Satenstein=0ASENT FROM = MY OPEN SOURCE LINUX SYSTEM.=0A=0A=0A=0A=0A>_______________________________= _=0A> From: P=E1draig Brady <P@HIDDEN>=0A>To: Jan Novak <jn@HIDDEN= k> =0A>Cc: 17196 <at> debbugs.gnu.org =0A>Sent: Sunday, April 6, 2014 6:15 AM=0A= >Subject: bug#17196: UTF-8 printf string formating problem=0A> =0A>=0A>On = 04/06/2014 12:17 AM, Jan Novak wrote:=0A>> Hello,=0A>> =0A>> printf string = format counts bytes instead of chars, which leads to broken output ...=0A>>= (the same problem occurs with bash built in printf)=0A>> =0A>> =0A>> just = try this:=0A>> =0A>> $ echo $LANG=0A>> us_US.UTF-8=0A>> =0A>> =0A>> $ print= f "|%3s|\n" "a"=0A>> |=A0 a|=0A>> =0A>> $ printf "|%3s|\n" "=E1"=A0 =A0 (c= har is a-acute)=0A>> | =E1|=0A>> =0A>> expected output:=0A>> |=A0 =E1|=0A>>= =0A>> Is there some easy solution ?=0A>> =0A>> TIA for the answer=0A>=0A>Y= es printf follows the C standard which only considers bytes.=0A>awk does re= spect characters in width specifiers though:=0A>=0A>=A0 $ awk 'BEGIN{printf= "|%3s|\n", "=E1"}'=0A>=A0 |=A0 =E1|=0A>=0A>I don't think we'd be able to c= hange the current operation of printf=0A>due to backwards compat reasons? T= hough we might be able to somehow leverage=0A>the existing multibyte charac= ter aware alignment/truncation code in:=0A>http://git.sv.gnu.org/gitweb/?p= =3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsalign.c;hb=3DHEAD=0A>=0A>thanks,=0A= >P=E1draig.=0A>=0A>=0A>=0A>=0A>=0A> --562241088-351124307-1399601805=:73330 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable <html><body><div style=3D"color:#000; background-color:#fff; font-family:He= lveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;fo= nt-size:14pt"><div><span>Perhaps printf() needs some wide character extensi= ons via %new characters<br></span></div><div> </div><div><div><div><di= v><div><div><div><span style=3D"" lang=3D"FR-CA">Regards</span> <div><b><f= ont size=3D"2"><br></font><font size=3D"2"> Leslie</font><br></b></div= > <div><font color=3D"green"><b><font size=3D"1">Mr. Leslie Satenstein</fon= t></b></font><font style=3D"color:rgb(191, 0, 95);" color=3D"green" size=3D= "1"><span style=3D"font-weight:bold;"></span></font><br></div><font color= =3D"green" size=3D"2"><b>SENT FROM MY OPEN SOURCE LINUX SYSTEM.</b><br></fo= nt><br><font face=3D"lucida console, sans-serif" size=3D"1"><b><font color= =3D"black"><span style=3D"font-weight:bold;font-size:13.5pt;color:black;"><= /span></font></b></font></div></div></div></div></div></div></div><div><br>= </div><blockquote style=3D"border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; margin= -top: 5px; padding-left: 5px;"> <div style=3D"font-family: HelveticaNeue, = Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 14p= t;"> <div style=3D"font-family: HelveticaNeue, Helvetica Neue, Helvetica, A= rial, Lucida Grande, sans-serif; font-size: 12pt;"> <div dir=3D"ltr"> <hr s= ize=3D"1"> <font face=3D"Arial" size=3D"2"> <b><span style=3D"font-weight:= bold;">From:</span></b> P=E1draig Brady <P@HIDDEN><br> <b><sp= an style=3D"font-weight: bold;">To:</span></b> Jan Novak <jn@HIDDEN>= ; <br><b><span style=3D"font-weight: bold;">Cc:</span></b> 17196@HIDDEN= u.org <br> <b><span style=3D"font-weight: bold;">Sent:</span></b> Sunday, A= pril 6, 2014 6:15 AM<br> <b><span style=3D"font-weight: bold;">Subject:</sp= an></b> bug#17196: UTF-8 printf string formating problem<br> </font> </div= > <div class=3D"y_msg_container"><br>On 04/06/2014 12:17 AM, Jan Novak wrot= e:<br>> Hello,<br>> <br>> printf string format counts bytes instead of chars= , which leads to broken output ...<br>> (the same problem occurs with ba= sh built in printf)<br>> <br>> <br>> just try this:<br>> <br>&g= t; $ echo $LANG<br>> us_US.UTF-8<br>> <br>> <br>> $ printf "|%3= s|\n" "a"<br>> | a|<br>> <br>> $ printf "|%3s|\n" "=E1" = ; (char is a-acute)<br>> | =E1|<br>> <br>> expected output= :<br>> | =E1|<br>> <br>> Is there some easy solution ?<br>&g= t; <br>> TIA for the answer<br><br>Yes printf follows the C standard whi= ch only considers bytes.<br>awk does respect characters in width specifiers= though:<br><br> $ awk 'BEGIN{printf "|%3s|\n", "=E1"}'<br> |&n= bsp; =E1|<br><br>I don't think we'd be able to change the current operation= of printf<br>due to backwards compat reasons? Though we might be able to s= omehow leverage<br>the existing multibyte character aware alignment/truncation code in:<br><a href=3D"http://git.sv.gnu.org/gitweb/?= p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsalign.c;hb=3DHEAD" target=3D"_blan= k">http://git.sv.gnu.org/gitweb/?p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/mbsa= lign.c;hb=3DHEAD</a><br><br>thanks,<br>P=E1draig.<br><br><br><br><br><br></= div> </div> </div> </blockquote><div></div> </div></body></html> --562241088-351124307-1399601805=:73330--
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 13:41:08 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 09:41:08 2014 Received: from localhost ([127.0.0.1]:45289 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WYbha-0005B4-Nf for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 09:41:07 -0400 Received: from forward7l.mail.yandex.net ([84.201.143.140]:42991) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <sdaoden@HIDDEN>) id 1WYbhO-00059w-Gk for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 09:40:56 -0400 Received: from smtp4h.mail.yandex.net (smtp4h.mail.yandex.net [84.201.186.21]) by forward7l.mail.yandex.net (Yandex) with ESMTP id 08AE8BC121D; Fri, 11 Apr 2014 17:40:46 +0400 (MSK) Received: from smtp4h.mail.yandex.net (localhost [127.0.0.1]) by smtp4h.mail.yandex.net (Yandex) with ESMTP id B3F682C372C; Fri, 11 Apr 2014 17:40:45 +0400 (MSK) Received: from unknown (unknown [89.204.130.136]) by smtp4h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 9aGtRyfnHc-ehhehFk0; Fri, 11 Apr 2014 17:40:44 +0400 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: fefe3f1e-da47-4b6e-9f5e-ec97a4eeadf9 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1397223645; bh=iKgWkZ1dAoJGcmbN1EZ9S055yr2JEhe3s9NOxejS4+w=; h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To: User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding; b=QxxGnYULunGfb82f84Eqj3sRYGuu9/mlfCwoVCTo1DT2909HLr6mHoncI8ahkR9tp KkiM0cm2yyw2dEpdj7VboS0bMDd0L6uF6rRWocOTYmmTve5S8OOmr5pBMnts3PJ0+p enBLx5Nd7yyi7Hyhv+gEx2woEGJXU9NtJNekZLq4= Authentication-Results: smtp4h.mail.yandex.net; dkim=pass header.i=@yandex.com Date: Fri, 11 Apr 2014 15:40:41 +0200 From: Steffen Nurpmeso <sdaoden@HIDDEN> To: chet.ramey@HIDDEN Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140411144041.KmpeitNBK3J2xP1tlaqPyJ+P@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> <20140410075610.GO26358@HIDDEN> <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> <5346DE92.9020004@HIDDEN> <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN> <5347DF27.50702@HIDDEN> In-Reply-To: <5347DF27.50702@HIDDEN> User-Agent: s-nail v14.6.4-1-ga39836e MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, =?ISO-8859-1?Q?P=E1draig?= Brady <P@HIDDEN>, Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, Eric Blake <eblake@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) Chet Ramey <chet.ramey@HIDDEN> wrote: |On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote: |> Hello, |>=20 |> Chet Ramey <chet.ramey@HIDDEN> wrote: |>|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote: |>| |>|> Even better would nonetheless be the great picture with |>|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns |>|> a tuple of {[EastAsianWidth indication,] is-combining, |>|> width-if-non-combining} and best even some composition function. |>| |>|But we have always been at war with EastAsia! |>=20 |> I see you really would love to get a hand from POSIX too: | |I'm sorry, I realize that was rather obscure. It's from "1984", by Georg= e |Orwell. It's a central theme to the book. The quote was an attempt to oh, ah, yes. So.. i got it right without getting it right. Interestingly, yesterday started a retrospective work on Walter Benjamin (<http://www.eingedenken.de/enter.html> -- "rememberance"): an artist (Christoph Korn) walked hist last trip from Banyuls-sur-Mer (France) to Portbou (Spain; where he committed suicide due to the impossibility to reach the U.S.), following a fixated time frame (monotonic tick, so to say) after which he spoke thesis of Benjamin (like, e.g., "There is no document of civilization which is not at the same time a document of barbarism."), followed by holding in and taking a (steady cam) video of the recent leg. Association with Paul Klees "Angelus Novus" is desired (from both parties). |inject levity into the discussion. That was easy. --steffen
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 12:26:04 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 08:26:04 2014 Received: from localhost ([127.0.0.1]:45241 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WYaWw-0001vM-Dy for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 08:26:04 -0400 Received: from mpv1.tis.cwru.edu ([129.22.105.36]:19616) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <chet.ramey@HIDDEN>) id 1WYaWp-0001uc-2I for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 08:25:59 -0400 Received: from mpv5.tis.CWRU.Edu (EHLO mpv5.cwru.edu) ([129.22.105.51]) by mpv1.tis.cwru.edu (MOS 4.3.5-GA FastPath queued) with ESMTP id BFC56788; Fri, 11 Apr 2014 08:25:38 -0400 (EDT) Received: from caleb.INS.CWRU.Edu (EHLO caleb.ins.cwru.edu) ([129.22.8.211]) by mpv5.cwru.edu (MOS 4.3.5-GA FastPath queued) with ESMTP id ATQ66868 (AUTH cpr); Fri, 11 Apr 2014 08:25:22 -0400 (EDT) Message-ID: <5347DF27.50702@HIDDEN> Date: Fri, 11 Apr 2014 08:25:11 -0400 From: Chet Ramey <chet.ramey@HIDDEN> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Steffen Nurpmeso <sdaoden@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> <20140410075610.GO26358@HIDDEN> <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> <5346DE92.9020004@HIDDEN> <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN> In-Reply-To: <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Junkmail-Status: score=10/50, host=mpv5.cwru.edu X-Junkmail-Whitelist: YES (by domain whitelist at mpv1.tis.cwru.edu) X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, chet.ramey@HIDDEN, =?ISO-8859-1?Q?P=E1draig_Brady?= <P@HIDDEN>, Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, Eric Blake <eblake@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: chet.ramey@HIDDEN List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.0 (/) On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote: > Hello, > > Chet Ramey <chet.ramey@HIDDEN> wrote: > |On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote: > | > |> Even better would nonetheless be the great picture with > |> a termios(4) IUTF8 flag, some extended xywidth(3) that returns > |> a tuple of {[EastAsianWidth indication,] is-combining, > |> width-if-non-combining} and best even some composition function. > | > |But we have always been at war with EastAsia! > > I see you really would love to get a hand from POSIX too: I'm sorry, I realize that was rather obscure. It's from "1984", by George Orwell. It's a central theme to the book. The quote was an attempt to inject levity into the discussion. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU chet@HIDDEN http://cnswww.cns.cwru.edu/~chet/
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 11 Apr 2014 10:16:34 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 11 06:16:34 2014 Received: from localhost ([127.0.0.1]:45202 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WYYVd-0003iz-FJ for submit <at> debbugs.gnu.org; Fri, 11 Apr 2014 06:16:33 -0400 Received: from forward7l.mail.yandex.net ([84.201.143.140]:36493) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <sdaoden@HIDDEN>) id 1WYYVZ-0003iN-Cq for 17196 <at> debbugs.gnu.org; Fri, 11 Apr 2014 06:16:31 -0400 Received: from smtp3h.mail.yandex.net (smtp3h.mail.yandex.net [84.201.186.20]) by forward7l.mail.yandex.net (Yandex) with ESMTP id 9B562BC0CD3; Fri, 11 Apr 2014 14:16:21 +0400 (MSK) Received: from smtp3h.mail.yandex.net (localhost [127.0.0.1]) by smtp3h.mail.yandex.net (Yandex) with ESMTP id 6A4581B42685; Fri, 11 Apr 2014 14:16:20 +0400 (MSK) Received: from unknown (unknown [89.204.130.136]) by smtp3h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id aicjDkOG4s-GI5WWnkl; Fri, 11 Apr 2014 14:16:19 +0400 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: a0e620ee-628c-4bf3-a359-9abdf75c88a8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1397211380; bh=nz2MoLGIO2JYvfrGDO7XeE09n04cOIPAGjaZG4ykzqc=; h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To: User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding; b=Ti5Ca3FysnnwnBuY/Fd5aCWPAYU/inHj4IvhDMay0u9OuDmINGRbmApwNu+7Yblsl Jz6/mC7WuDqHXD6S4i9nZQy/Mqn8+2p1V8uVfZpvBiYQweQ/M5YGGR//LigMVwp5UY OUd4pA4JjRioqaUir/EATe5BUbXp/ToDisCRPjTk= Authentication-Results: smtp3h.mail.yandex.net; dkim=pass header.i=@yandex.com Date: Fri, 11 Apr 2014 12:16:15 +0200 From: Steffen Nurpmeso <sdaoden@HIDDEN> To: chet.ramey@HIDDEN Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140411111615.ho9kmtrCAOTLmdWnrbsIp1DI@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> <20140410075610.GO26358@HIDDEN> <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> <5346DE92.9020004@HIDDEN> In-Reply-To: <5346DE92.9020004@HIDDEN> User-Agent: s-nail v14.6.4-1-ga39836e MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, =?ISO-8859-1?Q?P=E1draig?= Brady <P@HIDDEN>, Rich Felker <dalias@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, Eric Blake <eblake@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) Hello, Chet Ramey <chet.ramey@HIDDEN> wrote: |On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote: | |> Even better would nonetheless be the great picture with |> a termios(4) IUTF8 flag, some extended xywidth(3) that returns |> a tuple of {[EastAsianWidth indication,] is-combining, |> width-if-non-combining} and best even some composition function. | |But we have always been at war with EastAsia! I see you really would love to get a hand from POSIX too: ?0[steffen@sherwood bash-4.3]$ grep -r UNICODE_COMB . = =20 ./lib/readline/display.c: if (t > 0 && UNICODE_COMBINING_CHAR (wc) &= & WCWIDTH (wc) =3D=3D 0) ./lib/readline/rlmbutil.h:#define UNICODE_COMBINING_CHAR(x) ((x) >=3D 768= && (x) <=3D 879) ./lib/readline/rlmbutil.h:# define WCWIDTH(wc) ((_rl_utf8locale && UNICO= DE_COMBINING_CHAR(wc)) ? 0 : wcwidth(wc)) And sorry for not making this clear for those who never dealt with the problem (which is probably not uncommon for filesystem or other kernel hackers): `EastAsianWidth' refers to a property of Unicode and ISO 10646: # EastAsianWidth-6.3.0.txt # Date: 2013-02-05, 20:09:00 GMT [KW, LI] # # East Asian Width Properties # # This file is an informative contributory data file in the # Unicode Character Database. # # Copyright (c) 1991-2013 Unicode, Inc. # For terms of use, see http://www.unicode.org/terms_of_use.html --steffen ... To be honest i must admit i first was pissed, so let me append the original first part of this message, please: and so the landslide had brought it down. But i would quote Paul Vixie, who stated in a todays' message gentlemen and ladies, we have met the enemy, and they are our egos. vixie From my point of view it's the matter of culture and philosophy (including religion) how to deal with that very problem. And i can assure you that Jehovas Witnesses, which visit me regulary for some years now, like to drink a bit of my Buddhistic tea. Paul Vixie is correct. I am stupid. With greetings from someone who will undergo his 42nd birthday soon
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 18:11:07 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 14:11:07 2014 Received: from localhost ([127.0.0.1]:44791 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WYJRJ-0002to-Fc for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 14:11:06 -0400 Received: from mpv2.tis.cwru.edu ([129.22.105.37]:11566) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <chet.ramey@HIDDEN>) id 1WYJRB-0002t1-3t for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 14:11:02 -0400 Received: from mpv6.tis.CWRU.Edu (EHLO mpv6.cwru.edu) ([129.22.104.221]) by mpv2.tis.cwru.edu (MOS 4.3.5-GA FastPath queued) with ESMTP id BDG14791; Thu, 10 Apr 2014 14:10:39 -0400 (EDT) Received: from caleb.INS.CWRU.Edu (EHLO caleb.ins.cwru.edu) ([129.22.8.211]) by mpv6.cwru.edu (MOS 4.3.5-GA FastPath queued) with ESMTP id AJH10974 (AUTH cpr); Thu, 10 Apr 2014 14:10:29 -0400 (EDT) Message-ID: <5346DE92.9020004@HIDDEN> Date: Thu, 10 Apr 2014 14:10:26 -0400 From: Chet Ramey <chet.ramey@HIDDEN> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Steffen Nurpmeso <sdaoden@HIDDEN>, Rich Felker <dalias@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> <20140410075610.GO26358@HIDDEN> <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> In-Reply-To: <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Junkmail-Status: score=10/50, host=mpv6.cwru.edu X-Junkmail-Whitelist: YES (by domain whitelist at mpv2.tis.cwru.edu) X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, chet.ramey@HIDDEN, =?ISO-8859-1?Q?P=E1draig_Brady?= <P@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, Eric Blake <eblake@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: chet.ramey@HIDDEN List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.0 (/) On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote: > Even better would nonetheless be the great picture with > a termios(4) IUTF8 flag, some extended xywidth(3) that returns > a tuple of {[EastAsianWidth indication,] is-combining, > width-if-non-combining} and best even some composition function. But we have always been at war with EastAsia! -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU chet@HIDDEN http://cnswww.cns.cwru.edu/~chet/
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 16:16:33 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 12:16:33 2014 Received: from localhost ([127.0.0.1]:39942 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WYHeS-0000qA-6P for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 12:16:32 -0400 Received: from forward10l.mail.yandex.net ([84.201.143.143]:56157) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <sdaoden@HIDDEN>) id 1WYHeO-0000pt-BG for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 12:16:30 -0400 Received: from smtp4o.mail.yandex.net (smtp4o.mail.yandex.net [37.140.190.29]) by forward10l.mail.yandex.net (Yandex) with ESMTP id 69B9EBA0CBD; Thu, 10 Apr 2014 20:16:21 +0400 (MSK) Received: from smtp4o.mail.yandex.net (localhost [127.0.0.1]) by smtp4o.mail.yandex.net (Yandex) with ESMTP id 9260123216B2; Thu, 10 Apr 2014 20:16:20 +0400 (MSK) Received: from unknown (unknown [89.204.139.192]) by smtp4o.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id r76UwoaVWs-GIC4GxRE; Thu, 10 Apr 2014 20:16:19 +0400 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: 8f680bf8-3e00-4234-9a76-8cb266ba010c DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1397146580; bh=/2u0mCxLCUU2rUNpki+IBTdyRqHu0/oZLifGzqhY8Uw=; h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To: User-Agent:MIME-Version:Content-Type; b=bVCvIg9HuCudcBkgzpK3b/GVkA77j4sSkPZhCrjPHKqyP2QI7tzxBBk1vLIeaVLOY a738WypoLW5AAvhswwi8sgQasG2D7jxaRxcWmrgf/O0ErByQifzZ+WlUOJsLr7K9Ew rukIqnkwW0Se9MEtgLtjmdJ5jgZkXHypfa5U7Iho= Authentication-Results: smtp4o.mail.yandex.net; dkim=pass header.i=@yandex.com Date: Thu, 10 Apr 2014 18:16:24 +0200 From: Steffen Nurpmeso <sdaoden@HIDDEN> To: Rich Felker <dalias@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140410171624.an/caJUtgdHJiK1DmeoKZPSP@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> <20140410075610.GO26358@HIDDEN> In-Reply-To: <20140410075610.GO26358@HIDDEN> User-Agent: s-nail v14.6.4-1-ga39836e MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Eric Blake <eblake@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, =?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) This is a multi-part message in MIME format. --=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_ Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Rich Felker <dalias@HIDDEN> wrote: |On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote: |> Eric Blake <eblake@HIDDEN> wrote: |>|Hmm. POSIX requires support for %ls (aka %S) according to byte counts, |>|and currently states that %Ls is undefined. But I would LOVE to have a |>|standardized spelling for counting characters instead of bytes. The |>|extension %Ls looks like a good candidate for standardization, precisel= y |>|because counting characters when printing a multibyte string is more |>|useful than counting bytes (you do NOT want to end in the middle of a |>|multibyte character), and because ksh offers it as existing practice. |>| |>|Your idea for counting "cells" (by which I'm assuming you mean one or |>|more characters that all display within the same cell of the terminal, |>|as if the end user saw only one grapheme), on the other hand, does not |>|seem to have any precedence, and I would strongly object to having %s [.] |> I see you are trying to invent the word character for code points |> and reserve the term "graphem" for user-perceived characters. |> This goes in line with the GNU library which has the existing |> practice to let wcwidth(3) return the value 1 for accents and |> other combining code points as well as so-called (Unicode) |> noncharacters. And who would call wcwidth(3) on something that is |> not to be drawn onto the screen directly afterwards. And, of |> course, which terminal will perform the composition of code points |> written via STD I/O to characters on its own. |> I think for quite a while it is up to the input methods to combine |> into something precomposed in order to let POSIX programs finally |> work with it. | |Many languages do not have precomposed forms for all the character |sequences they need, and for some, it would not even be practical to |have precomposed forms, and would force the use of complex input |methods instead of simple keyboard maps. And of course with UTF-8 decomposed forms of characters from an immense number of languages can occur in at least theory, in, e.g., a text file. The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very well be =C2=AB=C3=BC=C2=BB but also U+0076 U+0308 =C2=ABu =CC=88=C2=BB, dep= endent on where it came from. And note that my vim(1) composed U+00F6 when i tried to input the latter string automatically, i had to separate, enter each, and join them together to get at =C2=ABu=C2=BB plus, actually non-, combining diaeresis. (In fact actually =C2=ABcombining with a space=C2=BB.= ) Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when it really produces something visual. Even better would nonetheless be the great picture with a termios(4) IUTF8 flag, some extended xywidth(3) that returns a tuple of {[EastAsianWidth indication,] is-combining, width-if-non-combining} and best even some composition function. I don't think that =C2=ABuser-perceived characters don't have any precedence=C2=BB. A whole lot of development in the past decade on the winner side (that is, the other :) was exactly that -- making software barrier-free. If POSIX beams itself onto UTF-8 it should really consider to offer a way to be able to act on what the user really deals with. And that is, in the Unicode world -- and isn't that what the bug report is about --, not necessarily a mbrlen(3)-division of bytes. --steffen --=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_ Content-Type: message/rfc822 Content-Disposition: inline Content-Description: Original message content Received: from mxfront3h.mail.yandex.net ([127.0.0.1]) by mxfront3h.mail.yandex.net with LMTP id uF50cqbZ for <sdaoden@HIDDEN>; Thu, 10 Apr 2014 11:56:15 +0400 Received: from 216-12-86-13.cv.mvl.ntelos.net (216-12-86-13.cv.mvl.ntelos.net [216.12.86.13]) by mxfront3h.mail.yandex.net (nwsmtp/Yandex) with SMTP id rYLYuCwMqF-uEAeEXKa; Thu, 10 Apr 2014 11:56:14 +0400 X-Yandex-Uniq: 655f0aa3-4efb-4152-ab26-2bb01fe7b98d Received: from dalias by brightrain.aerifal.cx with local (Exim 3.15 #2) id 1WY9qE-0005Ha-00; Thu, 10 Apr 2014 07:56:10 +0000 Date: Thu, 10 Apr 2014 03:56:10 -0400 To: Steffen Nurpmeso <sdaoden@HIDDEN> Cc: Eric Blake <eblake@HIDDEN>, 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, =?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140410075610.GO26358@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> User-Agent: Mutt/1.5.21 (2010-09-15) From: Rich Felker <dalias@HIDDEN> Return-Path: dalias@HIDDEN X-Yandex-Forward: 1431d05c8f532bcc8fea61a74badcb33 Status: RO On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote: > Eric Blake <eblake@HIDDEN> wrote: > |>> Dan Douglas wrote: > |>>> ksh93 already has this feature using the "L" modifier: > |>>> > |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" > |>>> ★★★ > |>> > |>> At least there is prior art for it. > |> > |> So we can count bytes, chars or cells (graphemes). > |> > |> Thinking a bit more about it, I think shell level printf > |> should be dealing in text of the current encoding and counting cells. > |> In the edge case where you want to deal in bytes one can do: > |> LC_ALL=C printf ... > |> > |> I see that ksh behaves as I would expect and counts cells, > |> though requires the explicit %L enabler: > |> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > |> á★★ > |> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > |> A★ > |> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" > |> A > |> > |> zsh seems to just count characters: > |> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > |> á★ > |> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" > |> á★ > |> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > |> A★★ > |> > |> I see that dash gives invalid directive for any of %ls %Ls %S. > |> > |> Pity there is no consensus here. > |> Personally I would go for: > |> printf '%3s' 'blah' # count cells > |> printf '%3Ls' 'blah' # count chars > |> LANG=C '%3Ls' 'blah' # count bytes > |> LANG=C '%3s' 'blah' # count bytes > | > |Hmm. POSIX requires support for %ls (aka %S) according to byte counts, > |and currently states that %Ls is undefined. But I would LOVE to have a > |standardized spelling for counting characters instead of bytes. The > |extension %Ls looks like a good candidate for standardization, precisely > |because counting characters when printing a multibyte string is more > |useful than counting bytes (you do NOT want to end in the middle of a > |multibyte character), and because ksh offers it as existing practice. > | > |Your idea for counting "cells" (by which I'm assuming you mean one or > |more characters that all display within the same cell of the terminal, > |as if the end user saw only one grapheme), on the other hand, does not > |seem to have any precedence, and I would strongly object to having %s > |count by cells because %s already has a standardized (if unfortunate) > |meaning of counting by bytes. Maybe yet another extension is warranted > |(perhaps %LLs?) as a new notion for counting by cells instead of > |characters, but it's harder to justify that without existing practice. > > I see you are trying to invent the word character for code points > and reserve the term "graphem" for user-perceived characters. > This goes in line with the GNU library which has the existing > practice to let wcwidth(3) return the value 1 for accents and > other combining code points as well as so-called (Unicode) > noncharacters. And who would call wcwidth(3) on something that is > not to be drawn onto the screen directly afterwards. And, of > course, which terminal will perform the composition of code points > written via STD I/O to characters on its own. > I think for quite a while it is up to the input methods to combine > into something precomposed in order to let POSIX programs finally > work with it. Many languages do not have precomposed forms for all the character sequences they need, and for some, it would not even be practical to have precomposed forms, and would force the use of complex input methods instead of simple keyboard maps. Rich --=_01397146584=-hLlRJmGE22qxLp6/BPA3cw5+rq+yWU=_--
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 10 Apr 2014 07:56:24 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 10 03:56:24 2014 Received: from localhost ([127.0.0.1]:39544 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WY9qQ-0001JU-UV for submit <at> debbugs.gnu.org; Thu, 10 Apr 2014 03:56:23 -0400 Received: from 216-12-86-13.cv.mvl.ntelos.net ([216.12.86.13]:44012 helo=brightrain.aerifal.cx) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <dalias@HIDDEN>) id 1WY9qN-0001JE-H0 for 17196 <at> debbugs.gnu.org; Thu, 10 Apr 2014 03:56:20 -0400 Received: from dalias by brightrain.aerifal.cx with local (Exim 3.15 #2) id 1WY9qE-0005Ha-00; Thu, 10 Apr 2014 07:56:10 +0000 Date: Thu, 10 Apr 2014 03:56:10 -0400 To: Steffen Nurpmeso <sdaoden@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140410075610.GO26358@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> User-Agent: Mutt/1.5.21 (2010-09-15) From: Rich Felker <dalias@HIDDEN> X-Spam-Score: 0.4 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Eric Blake <eblake@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, =?utf-8?Q?P=C3=A1draig?= Brady <P@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.4 (/) On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote: > Eric Blake <eblake@HIDDEN> wrote: > |>> Dan Douglas wrote: > |>>> ksh93 already has this feature using the "L" modifier: > |>>> > |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" > |>>> ★★★ > |>> > |>> At least there is prior art for it. > |> > |> So we can count bytes, chars or cells (graphemes). > |> > |> Thinking a bit more about it, I think shell level printf > |> should be dealing in text of the current encoding and counting cells. > |> In the edge case where you want to deal in bytes one can do: > |> LC_ALL=C printf ... > |> > |> I see that ksh behaves as I would expect and counts cells, > |> though requires the explicit %L enabler: > |> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > |> á★★ > |> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > |> A★ > |> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" > |> A > |> > |> zsh seems to just count characters: > |> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > |> á★ > |> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" > |> á★ > |> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > |> A★★ > |> > |> I see that dash gives invalid directive for any of %ls %Ls %S. > |> > |> Pity there is no consensus here. > |> Personally I would go for: > |> printf '%3s' 'blah' # count cells > |> printf '%3Ls' 'blah' # count chars > |> LANG=C '%3Ls' 'blah' # count bytes > |> LANG=C '%3s' 'blah' # count bytes > | > |Hmm. POSIX requires support for %ls (aka %S) according to byte counts, > |and currently states that %Ls is undefined. But I would LOVE to have a > |standardized spelling for counting characters instead of bytes. The > |extension %Ls looks like a good candidate for standardization, precisely > |because counting characters when printing a multibyte string is more > |useful than counting bytes (you do NOT want to end in the middle of a > |multibyte character), and because ksh offers it as existing practice. > | > |Your idea for counting "cells" (by which I'm assuming you mean one or > |more characters that all display within the same cell of the terminal, > |as if the end user saw only one grapheme), on the other hand, does not > |seem to have any precedence, and I would strongly object to having %s > |count by cells because %s already has a standardized (if unfortunate) > |meaning of counting by bytes. Maybe yet another extension is warranted > |(perhaps %LLs?) as a new notion for counting by cells instead of > |characters, but it's harder to justify that without existing practice. > > I see you are trying to invent the word character for code points > and reserve the term "graphem" for user-perceived characters. > This goes in line with the GNU library which has the existing > practice to let wcwidth(3) return the value 1 for accents and > other combining code points as well as so-called (Unicode) > noncharacters. And who would call wcwidth(3) on something that is > not to be drawn onto the screen directly afterwards. And, of > course, which terminal will perform the composition of code points > written via STD I/O to characters on its own. > I think for quite a while it is up to the input methods to combine > into something precomposed in order to let POSIX programs finally > work with it. Many languages do not have precomposed forms for all the character sequences they need, and for some, it would not even be practical to have precomposed forms, and would force the use of complex input methods instead of simple keyboard maps. Rich
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 9 Apr 2014 15:47:24 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 09 11:47:24 2014 Received: from localhost ([127.0.0.1]:39200 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WXuig-0005dD-In for submit <at> debbugs.gnu.org; Wed, 09 Apr 2014 11:47:23 -0400 Received: from forward4l.mail.yandex.net ([84.201.143.137]:47067) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <sdaoden@HIDDEN>) id 1WXrwp-00010T-Jg for 17196 <at> debbugs.gnu.org; Wed, 09 Apr 2014 08:49:48 -0400 Received: from smtp1h.mail.yandex.net (smtp1h.mail.yandex.net [84.201.187.144]) by forward4l.mail.yandex.net (Yandex) with ESMTP id A5BE81441127; Wed, 9 Apr 2014 16:49:39 +0400 (MSK) Received: from smtp1h.mail.yandex.net (localhost [127.0.0.1]) by smtp1h.mail.yandex.net (Yandex) with ESMTP id B63851340F6C; Wed, 9 Apr 2014 16:49:38 +0400 (MSK) Received: from unknown (unknown [82.113.106.166]) by smtp1h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id WBC3dR9mYn-naD4f4Cf; Wed, 9 Apr 2014 16:49:37 +0400 (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Uniq: a0da012c-a10d-40b9-bc00-e1c953c90020 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1397047777; bh=CIw9qbQeAYBBogQWUXflnowQwxxIj8pfP4P/KUAWE2g=; h=Date:From:To:Cc:Subject:Message-ID:References:In-Reply-To: User-Agent:MIME-Version:Content-Type:Content-Transfer-Encoding; b=pBmJsEEBPEZU+9UxD87VZlFFasK4VUWKRpiwP1g+mz4W283R/aJarhrLG5STNS1TU GXyTUrA9CwVw5K6khosOA3krKyIWUPzmP7blmBxi0GdXWDhrk4gHU2gRXt5J7hz8ea qsq+F4t0/cVps584v90Jv8hDIyaPVLRzhhcpiaxc= Authentication-Results: smtp1h.mail.yandex.net; dkim=pass header.i=@yandex.com Date: Wed, 09 Apr 2014 14:49:37 +0200 From: Steffen Nurpmeso <sdaoden@HIDDEN> To: Eric Blake <eblake@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140409134937.G1Sjvh4wKZUfnofJrM0R7RoW@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> In-Reply-To: <53431F2F.8060701@HIDDEN> User-Agent: s-nail v14.6.4-1-ga39836e MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 X-Mailman-Approved-At: Wed, 09 Apr 2014 11:47:20 -0400 Cc: 17196 <at> debbugs.gnu.org, =?UTF-8?Q?P=C3=A1draig?= Brady <P@HIDDEN>, Bob Proulx <bob@HIDDEN>, Austin Group <austin-group-l@HIDDEN>, Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) Eric Blake <eblake@HIDDEN> wrote: |>> Dan Douglas wrote: |>>> ksh93 already has this feature using the "L" modifier: |>>>=20 |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" |>>> =E2=98=85=E2=98=85=E2=98=85 |>> |>> At least there is prior art for it. |>=20 |> So we can count bytes, chars or cells (graphemes). |>=20 |> Thinking a bit more about it, I think shell level printf |> should be dealing in text of the current encoding and counting cells. |> In the edge case where you want to deal in bytes one can do: |> LC_ALL=3DC printf ... |>=20 |> I see that ksh behaves as I would expect and counts cells, |> though requires the explicit %L enabler: |> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" |> a=CC=81=E2=98=85=E2=98=85 |> $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'" |> =EF=BC=A1=E2=98=85 |> $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1=EF=BC=A1\u2605\u2605\u2605'" |> =EF=BC=A1 |>=20 |> zsh seems to just count characters: |> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" |> a=CC=81=E2=98=85 |> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" |> a=CC=81=E2=98=85 |> $ zsh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'" |> =EF=BC=A1=E2=98=85=E2=98=85 |>=20 |> I see that dash gives invalid directive for any of %ls %Ls %S. |>=20 |> Pity there is no consensus here. |> Personally I would go for: |> printf '%3s' 'blah' # count cells |> printf '%3Ls' 'blah' # count chars |> LANG=3DC '%3Ls' 'blah' # count bytes |> LANG=3DC '%3s' 'blah' # count bytes | |Hmm. POSIX requires support for %ls (aka %S) according to byte counts, |and currently states that %Ls is undefined. But I would LOVE to have a |standardized spelling for counting characters instead of bytes. The |extension %Ls looks like a good candidate for standardization, precisely |because counting characters when printing a multibyte string is more |useful than counting bytes (you do NOT want to end in the middle of a |multibyte character), and because ksh offers it as existing practice. | |Your idea for counting "cells" (by which I'm assuming you mean one or |more characters that all display within the same cell of the terminal, |as if the end user saw only one grapheme), on the other hand, does not |seem to have any precedence, and I would strongly object to having %s |count by cells because %s already has a standardized (if unfortunate) |meaning of counting by bytes. Maybe yet another extension is warranted |(perhaps %LLs?) as a new notion for counting by cells instead of |characters, but it's harder to justify that without existing practice. I see you are trying to invent the word character for code points and reserve the term "graphem" for user-perceived characters. This goes in line with the GNU library which has the existing practice to let wcwidth(3) return the value 1 for accents and other combining code points as well as so-called (Unicode) noncharacters. And who would call wcwidth(3) on something that is not to be drawn onto the screen directly afterwards. And, of course, which terminal will perform the composition of code points written via STD I/O to characters on its own. I think for quite a while it is up to the input methods to combine into something precomposed in order to let POSIX programs finally work with it. --steffen
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 8 Apr 2014 01:28:18 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 21:28:18 2014 Received: from localhost ([127.0.0.1]:40037 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WXKpl-0004I6-Kj for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 21:28:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:19405) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <eblake@HIDDEN>) id 1WXKph-0004Hu-Rm for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 21:28:15 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s381SBwj003254 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 7 Apr 2014 21:28:12 -0400 Received: from [10.3.113.181] (ovpn-113-181.phx2.redhat.com [10.3.113.181]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s381SAUH012178; Mon, 7 Apr 2014 21:28:10 -0400 Message-ID: <534350AA.2050803@HIDDEN> Date: Mon, 07 Apr 2014 19:28:10 -0600 From: Eric Blake <eblake@HIDDEN> Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> <53433EA1.4010204@HIDDEN> In-Reply-To: <53433EA1.4010204@HIDDEN> X-Enigmail-Version: 1.6 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Spam-Score: -5.3 (-----) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -5.3 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 04/07/2014 06:11 PM, P=C3=A1draig Brady wrote: >=20 > If we had to make it explicit for backwards compat reasons, > then I suppose counting by characters is the least useful, > so we could just standardize the existing ksh behavior and have: >=20 > printf '%3s' 'blah' # count bytes > printf '%3Ls' 'blah' # count cells > LANG=3DC '%3Ls' 'blah' # count bytes If we add %3Ls to the shell, we should also add it to libc's printf(3), which means coordinating with the C committee. >=20 > This has the disadvantage of not degrading gracefully > on dash for example where %Ls is rejected. If a future version of the standard mandates behavior for %Ls, I suspect dash would be made compliant fairly quickly - the dash maintainers strive hard to comply with POSIX. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJTQ1CqAAoJEKeha0olJ0Nq1fMH/iocyOefBelzJjRFQe9OpSZH U4Od8i/T8FNt+2kaUbaYud8Hq7hlciSdp1vbB1GFur89qQ9hH5fzvQMEdZyhaazx Rurfq8nT1hBjUkNbbb60TYovJY71Pqkmuop32BrmpwYNoM/K2cthcHD9RO7djXQ0 lN/zAEFtrs7/ETJT2/FrieIBci98bCjggEMQ15rbkpTPZ6sWJLk03aHqpDZKQ/+j 8GD7fZJwCKWV4g3Rn13Qc+enT9Wnxx1L5Y+6P5fGbx7pxPD6mK3pUmyCewwjFong iKM9H7fb2iUaWphMlefooeWhnvtvb38E9Srm78N0ZQsIH/iMbTknOfT07I5mw48= =XKN5 -----END PGP SIGNATURE----- --WKOPh9dwtCRoU95obFdvxAM1SCC5dAhQe--
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 8 Apr 2014 00:11:23 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 20:11:23 2014 Received: from localhost ([127.0.0.1]:40018 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WXJdG-0002J3-TE for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 20:11:23 -0400 Received: from mail2.vodafone.ie ([213.233.128.44]:3379) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <P@HIDDEN>) id 1WXJdE-0002Iu-KO for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 20:11:17 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApUBAHY9Q1NtTJL0/2dsb2JhbAANTINBg2G5bIc3gT2DGQEBAQMBAQIgDwFGBQsJAg0BCgICBRYLAgIJAwIBAgEWLwYNAQcBAYdtDQiMc5sidqIwF4EpjUgHgm+BSQEDlgSEC4VFjnc Received: from unknown (HELO [192.168.1.79]) ([109.76.146.244]) by mail2.vodafone.ie with ESMTP; 08 Apr 2014 01:11:14 +0100 Message-ID: <53433EA1.4010204@HIDDEN> Date: Tue, 08 Apr 2014 01:11:13 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Eric Blake <eblake@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> <53431F2F.8060701@HIDDEN> In-Reply-To: <53431F2F.8060701@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>, Bob Proulx <bob@HIDDEN>, Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) On 04/07/2014 10:57 PM, Eric Blake wrote: > [adding the Austin Group] > > On 04/07/2014 07:08 AM, Pádraig Brady wrote: >> On 04/06/2014 07:24 PM, Bob Proulx wrote: >>> Pádraig Brady wrote: >>>> Yes printf follows the C standard which only considers bytes. >>>> ... >>>> I don't think we'd be able to change the current operation of printf >>>> due to backwards compat reasons? Though we might be able to somehow leverage >>>> the existing multibyte character aware alignment/truncation code in: >>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD >>> >>> Dan Douglas pointed out in the corresponding discussion in bug-bash >>> that ksh uses the L modifier. >>> >>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html >>> >>> Dan Douglas wrote: >>> > ksh93 already has this feature using the "L" modifier: >>> > >>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" >>> > ★★★ >>> >>> At least there is prior art for it. >> >> So we can count bytes, chars or cells (graphemes). >> >> Thinking a bit more about it, I think shell level printf >> should be dealing in text of the current encoding and counting cells. >> In the edge case where you want to deal in bytes one can do: >> LC_ALL=C printf ... >> >> I see that ksh behaves as I would expect and counts cells, >> though requires the explicit %L enabler: >> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★★ >> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" >> A★ >> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" >> A >> >> zsh seems to just count characters: >> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" >> A★★ >> >> I see that dash gives invalid directive for any of %ls %Ls %S. >> >> Pity there is no consensus here. >> Personally I would go for: >> printf '%3s' 'blah' # count cells >> printf '%3Ls' 'blah' # count chars >> LANG=C '%3Ls' 'blah' # count bytes >> LANG=C '%3s' 'blah' # count bytes > > Hmm. POSIX requires support for %ls (aka %S) according to byte counts, > and currently states that %Ls is undefined. But I would LOVE to have a > standardized spelling for counting characters instead of bytes. The > extension %Ls looks like a good candidate for standardization, precisely > because counting characters when printing a multibyte string is more > useful than counting bytes (you do NOT want to end in the middle of a > multibyte character), and because ksh offers it as existing practice. Note ksh seems to count cells with %Ls > Your idea for counting "cells" (by which I'm assuming you mean one or > more characters that all display within the same cell of the terminal, > as if the end user saw only one grapheme), on the other hand, does not > seem to have any precedence, and I would strongly object to having %s > count by cells because %s already has a standardized (if unfortunate) > meaning of counting by bytes. Maybe yet another extension is warranted > (perhaps %LLs?) as a new notion for counting by cells instead of > characters, but it's harder to justify that without existing practice. At the shell level I expect that the vast majority of uses would prefer to be specifying cell counts. I thought there might not be much backwards compat issues with doing that, especially since zsh and gawk adjust the meaning of %s according to the locale (albeit for char rather than cell count). But it's a fair point that there may be scripts that don't consider the zsh behavior. If we had to make it explicit for backwards compat reasons, then I suppose counting by characters is the least useful, so we could just standardize the existing ksh behavior and have: printf '%3s' 'blah' # count bytes printf '%3Ls' 'blah' # count cells LANG=C '%3Ls' 'blah' # count bytes This has the disadvantage of not degrading gracefully on dash for example where %Ls is rejected. thanks, Pádraig.
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 21:57:11 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 17:57:11 2014 Received: from localhost ([127.0.0.1]:39976 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WXHXS-0007Jn-5M for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:57:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50461) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <eblake@HIDDEN>) id 1WXHXO-0007Jd-Is for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:57:08 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s37Lv4Hw005827 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 7 Apr 2014 17:57:05 -0400 Received: from [10.3.113.181] (ovpn-113-181.phx2.redhat.com [10.3.113.181]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id s37Lv3Y2001250; Mon, 7 Apr 2014 17:57:04 -0400 Message-ID: <53431F2F.8060701@HIDDEN> Date: Mon, 07 Apr 2014 15:57:03 -0600 From: Eric Blake <eblake@HIDDEN> Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>, Bob Proulx <bob@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> In-Reply-To: <5342A337.9000407@HIDDEN> X-Enigmail-Version: 1.6 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8" X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 X-Spam-Score: -5.3 (-----) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l@HIDDEN>, Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -5.3 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable [adding the Austin Group] On 04/07/2014 07:08 AM, P=C3=A1draig Brady wrote: > On 04/06/2014 07:24 PM, Bob Proulx wrote: >> P=C3=A1draig Brady wrote: >>> Yes printf follows the C standard which only considers bytes. >>> ... >>> I don't think we'd be able to change the current operation of printf >>> due to backwards compat reasons? Though we might be able to somehow l= everage >>> the existing multibyte character aware alignment/truncation code in: >>> http://git.sv.gnu.org/gitweb/?p=3Dcoreutils.git;a=3Dblob;f=3Dgl/lib/m= bsalign.c;hb=3DHEAD >> >> Dan Douglas pointed out in the corresponding discussion in bug-bash >> that ksh uses the L modifier. >> >> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html >> >> Dan Douglas wrote: >> > ksh93 already has this feature using the "L" modifier: >> >=20 >> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" >> > =E2=98=85=E2=98=85=E2=98=85 >> >> At least there is prior art for it. >=20 > So we can count bytes, chars or cells (graphemes). >=20 > Thinking a bit more about it, I think shell level printf > should be dealing in text of the current encoding and counting cells. > In the edge case where you want to deal in bytes one can do: > LC_ALL=3DC printf ... >=20 > I see that ksh behaves as I would expect and counts cells, > though requires the explicit %L enabler: > $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > a=CC=81=E2=98=85=E2=98=85 > $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'" > =EF=BC=A1=E2=98=85 > $ ksh -c "printf '%.3Ls\n' $'=EF=BC=A1=EF=BC=A1\u2605\u2605\u2605'" > =EF=BC=A1 >=20 > zsh seems to just count characters: > $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > a=CC=81=E2=98=85 > $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" > a=CC=81=E2=98=85 > $ zsh -c "printf '%.3Ls\n' $'=EF=BC=A1\u2605\u2605\u2605'" > =EF=BC=A1=E2=98=85=E2=98=85 >=20 > I see that dash gives invalid directive for any of %ls %Ls %S. >=20 > Pity there is no consensus here. > Personally I would go for: > printf '%3s' 'blah' # count cells > printf '%3Ls' 'blah' # count chars > LANG=3DC '%3Ls' 'blah' # count bytes > LANG=3DC '%3s' 'blah' # count bytes Hmm. POSIX requires support for %ls (aka %S) according to byte counts, and currently states that %Ls is undefined. But I would LOVE to have a standardized spelling for counting characters instead of bytes. The extension %Ls looks like a good candidate for standardization, precisely because counting characters when printing a multibyte string is more useful than counting bytes (you do NOT want to end in the middle of a multibyte character), and because ksh offers it as existing practice. Your idea for counting "cells" (by which I'm assuming you mean one or more characters that all display within the same cell of the terminal, as if the end user saw only one grapheme), on the other hand, does not seem to have any precedence, and I would strongly object to having %s count by cells because %s already has a standardized (if unfortunate) meaning of counting by bytes. Maybe yet another extension is warranted (perhaps %LLs?) as a new notion for counting by cells instead of characters, but it's harder to justify that without existing practice. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJTQx8vAAoJEKeha0olJ0NqbWkH/AtqespL088wPpB5djiIJwc6 L4oyBo3wMGOdB3XIV4eeJzGm9shYMA9aVw+8y1VH/5xTi52FqTmy0EkVsJ/nDrb0 ZU3OyXQC5U5s/ufcgY5oIo0IBVSduetbR0rgG1/I7rNyqiLV0+AK5RJcwDcAxmaT 5mhrpYMnKHIhDwKBlZ+Fm224o8jDHvg46C7R2XmHCAQ5ayKfw6mMYqyyup0pHDyO /Bu8dhdLmIsj+prRw5JkqvyEO1gfo0rJC005kktqD4zr3NWpkwDSG7O8CAW67ZMV G305iLrgEkr6knbmLt/BjDci6OyPvmNqSYataieBWkmUKoYl4GPjfY9sQsi93Fw= =vBNo -----END PGP SIGNATURE----- --IT8uTs5CGj9Cnq7XtFEWmDVt7rWHNjtJ8--
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 21:41:11 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 17:41:11 2014 Received: from localhost ([127.0.0.1]:39963 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WXHHy-0006uX-Th for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:41:11 -0400 Received: from smtp1.gts.sk ([195.168.0.153]:49961 helo=smtp5.gts.sk) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <jn@HIDDEN>) id 1WXHHv-0006uJ-H1 for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 17:41:08 -0400 Received: from localhost (localhost [127.0.0.1]) by smtp5.gts.sk (Postfix) with ESMTP id EBF68E805D; Mon, 7 Apr 2014 23:41:05 +0200 (CEST) X-Virus-Scanned: amavisd-new at nextra.sk Received: from smtp5.gts.sk ([195.168.0.153]) by localhost (smtp.gts.sk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9YEBQzx29SJh; Mon, 7 Apr 2014 23:41:04 +0200 (CEST) Received: from [10.1.2.4] (188-167-225-220.dynamic.chello.sk [188.167.225.220]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nkame@HIDDEN) by smtp5.gts.sk (Postfix) with ESMTPSA id 352F0E8006; Mon, 7 Apr 2014 23:41:04 +0200 (CEST) Message-ID: <53431B6F.1040108@HIDDEN> Date: Mon, 07 Apr 2014 23:41:03 +0200 From: Jan Novak <jn@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN>, Bob Proulx <bob@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> <5342A337.9000407@HIDDEN> In-Reply-To: <5342A337.9000407@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.0 (/) Pádraig Brady wrote: > Pity there is no consensus here. > Personally I would go for: > printf '%3s' 'blah' # count cells > printf '%3Ls' 'blah' # count chars > LANG=C '%3Ls' 'blah' # count bytes > LANG=C '%3s' 'blah' # count bytes I vote for it ... it is excellent idea, that "standard" notation works properly in localized environment ! (because this is exactly what users expect) Thanks ! novak
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 7 Apr 2014 13:08:13 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 07 09:08:13 2014 Received: from localhost ([127.0.0.1]:38921 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WX9HY-0000N1-MP for submit <at> debbugs.gnu.org; Mon, 07 Apr 2014 09:08:13 -0400 Received: from mail2.vodafone.ie ([213.233.128.44]:10186) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <P@HIDDEN>) id 1WX9HV-0000Mp-MS for 17196 <at> debbugs.gnu.org; Mon, 07 Apr 2014 09:08:10 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApUBAFCiQlNtTJL0/2dsb2JhbAANTINBg2G5WYc3gTeDGQEBAQQBAiAPAUYQCQINCwICBRYLAgIJAwIBAgEWLwYNAQcBAYd6CI0JmyJ2oiAXgSmNSAeCb4FJAQOWBIQLhUWOdw Received: from unknown (HELO [192.168.1.79]) ([109.76.146.244]) by mail2.vodafone.ie with ESMTP; 07 Apr 2014 14:08:08 +0100 Message-ID: <5342A337.9000407@HIDDEN> Date: Mon, 07 Apr 2014 14:08:07 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Bob Proulx <bob@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> <20140406182447.GA1381@HIDDEN> In-Reply-To: <20140406182447.GA1381@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org, Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) On 04/06/2014 07:24 PM, Bob Proulx wrote: > Pádraig Brady wrote: >> Yes printf follows the C standard which only considers bytes. >> ... >> I don't think we'd be able to change the current operation of printf >> due to backwards compat reasons? Though we might be able to somehow leverage >> the existing multibyte character aware alignment/truncation code in: >> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD > > Dan Douglas pointed out in the corresponding discussion in bug-bash > that ksh uses the L modifier. > > http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html > > Dan Douglas wrote: > > ksh93 already has this feature using the "L" modifier: > > > > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" > > ★★★ > > At least there is prior art for it. So we can count bytes, chars or cells (graphemes). Thinking a bit more about it, I think shell level printf should be dealing in text of the current encoding and counting cells. In the edge case where you want to deal in bytes one can do: LC_ALL=C printf ... I see that ksh behaves as I would expect and counts cells, though requires the explicit %L enabler: $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" á★★ $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" A★ $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" A zsh seems to just count characters: $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" á★ $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" á★ $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" A★★ I see that dash gives invalid directive for any of %ls %Ls %S. Pity there is no consensus here. Personally I would go for: printf '%3s' 'blah' # count cells printf '%3Ls' 'blah' # count chars LANG=C '%3Ls' 'blah' # count bytes LANG=C '%3s' 'blah' # count bytes Pádraig.
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 18:24:53 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 14:24:53 2014 Received: from localhost ([127.0.0.1]:38329 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WWrkS-0007s0-9M for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:24:52 -0400 Received: from joseki.proulx.com ([216.17.153.58]:48570) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <bob@HIDDEN>) id 1WWrkO-0007rk-QD for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:24:50 -0400 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 9224721233; Sun, 6 Apr 2014 12:24:47 -0600 (MDT) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 62F292DC9A; Sun, 6 Apr 2014 12:24:47 -0600 (MDT) Date: Sun, 6 Apr 2014 12:24:47 -0600 From: Bob Proulx <bob@HIDDEN> To: 17196 <at> debbugs.gnu.org Subject: Re: bug#17196: UTF-8 printf string formating problem Message-ID: <20140406182447.GA1381@HIDDEN> References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <53412952.1040506@HIDDEN> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -0.3 (/) X-Debbugs-Envelope-To: 17196 Cc: Jan Novak <jn@HIDDEN> X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.3 (/) Pádraig Brady wrote: > Yes printf follows the C standard which only considers bytes. > ... > I don't think we'd be able to change the current operation of printf > due to backwards compat reasons? Though we might be able to somehow leverage > the existing multibyte character aware alignment/truncation code in: > http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD Dan Douglas pointed out in the corresponding discussion in bug-bash that ksh uses the L modifier. http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html Dan Douglas wrote: > ksh93 already has this feature using the "L" modifier: > > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" > ★★★ At least there is prior art for it. Bob
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 18:13:26 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 14:13:26 2014 Received: from localhost ([127.0.0.1]:38323 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WWrZN-0007Yy-TT for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:13:26 -0400 Received: from mail1.vodafone.ie ([213.233.128.43]:63816) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <P@HIDDEN>) id 1WWrZL-0007Yl-De for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 14:13:24 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApQBALyYQVNtT6Td/2dsb2JhbAANS4civX+DDoErgxkBAQEEIw8BRhALDQEKAgIFFgsCAgkDAgECAUUGDQEHAQEXh2OoSXaiFReBKY1IB4JvgUkBA59Ujnc Received: from unknown (HELO [192.168.1.79]) ([109.79.164.221]) by mail1.vodafone.ie with ESMTP; 06 Apr 2014 19:13:21 +0100 Message-ID: <53419941.7090105@HIDDEN> Date: Sun, 06 Apr 2014 19:13:21 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Jan Novak <jn@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> <53412952.1040506@HIDDEN> In-Reply-To: <53412952.1040506@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) On 04/06/2014 11:15 AM, Pádraig Brady wrote: > On 04/06/2014 12:17 AM, Jan Novak wrote: >> Hello, >> >> printf string format counts bytes instead of chars, which leads to broken output ... >> (the same problem occurs with bash built in printf) >> >> >> just try this: >> >> $ echo $LANG >> us_US.UTF-8 >> >> >> $ printf "|%3s|\n" "a" >> | a| >> >> $ printf "|%3s|\n" "á" (char is a-acute) >> | á| >> >> expected output: >> | á| >> >> Is there some easy solution ? >> >> TIA for the answer > > Yes printf follows the C standard which only considers bytes. > awk does respect characters in width specifiers though: > > $ awk 'BEGIN{printf "|%3s|\n", "á"}' > | á| Jan points out to me the the awk solution is not portable to mawk 1.3.3 at least. I used GNU Awk 3.1.8 above. Pádraig.
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at 17196) by debbugs.gnu.org; 6 Apr 2014 10:15:50 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sun Apr 06 06:15:50 2014 Received: from localhost ([127.0.0.1]:37447 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WWk7C-0007Rr-2C for submit <at> debbugs.gnu.org; Sun, 06 Apr 2014 06:15:50 -0400 Received: from mail1.vodafone.ie ([213.233.128.43]:17840) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <P@HIDDEN>) id 1WWk79-0007Rf-Q1 for 17196 <at> debbugs.gnu.org; Sun, 06 Apr 2014 06:15:48 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApQBAEUnQVNtT6Td/2dsb2JhbAANS4NBg2HBBoErgxkBAQEEIw8BRhALDQEKAgIFFgsCAgkDAgECAUUGDQEHAQEXh2MIqg12oXoXgSmNSAeCb4FJAQOfVI53 Received: from unknown (HELO [192.168.1.79]) ([109.79.164.221]) by mail1.vodafone.ie with ESMTP; 06 Apr 2014 11:15:45 +0100 Message-ID: <53412952.1040506@HIDDEN> Date: Sun, 06 Apr 2014 11:15:46 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= <P@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Jan Novak <jn@HIDDEN> Subject: Re: bug#17196: UTF-8 printf string formating problem References: <53408EFF.7050601@HIDDEN> In-Reply-To: <53408EFF.7050601@HIDDEN> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 17196 Cc: 17196 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.0 (/) On 04/06/2014 12:17 AM, Jan Novak wrote: > Hello, > > printf string format counts bytes instead of chars, which leads to broken output ... > (the same problem occurs with bash built in printf) > > > just try this: > > $ echo $LANG > us_US.UTF-8 > > > $ printf "|%3s|\n" "a" > | a| > > $ printf "|%3s|\n" "á" (char is a-acute) > | á| > > expected output: > | á| > > Is there some easy solution ? > > TIA for the answer Yes printf follows the C standard which only considers bytes. awk does respect characters in width specifiers though: $ awk 'BEGIN{printf "|%3s|\n", "á"}' | á| I don't think we'd be able to change the current operation of printf due to backwards compat reasons? Though we might be able to somehow leverage the existing multibyte character aware alignment/truncation code in: http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD thanks, Pádraig.
bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.Received: (at submit) by debbugs.gnu.org; 5 Apr 2014 23:21:34 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Apr 05 19:21:34 2014 Received: from localhost ([127.0.0.1]:37178 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1WWZu1-0003Dh-7N for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:21:33 -0400 Received: from eggs.gnu.org ([208.118.235.92]:40757) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <jn@HIDDEN>) id 1WWZqP-000375-Bi for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:49 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <jn@HIDDEN>) id 1WWZqF-00042O-7k for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:49 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_40, RCVD_IN_BL_SPAMCOP_NET autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:57101) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>) id 1WWZqF-00042K-4T for submit <at> debbugs.gnu.org; Sat, 05 Apr 2014 19:17:39 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42472) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>) id 1WWZq7-0001j1-KZ for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <jn@HIDDEN>) id 1WWZq0-00041i-6a for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:31 -0400 Received: from smtp1.gts.sk ([195.168.0.153]:52608 helo=smtp5.gts.sk) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <jn@HIDDEN>) id 1WWZpz-00041S-VS for bug-coreutils@HIDDEN; Sat, 05 Apr 2014 19:17:24 -0400 Received: from localhost (localhost [127.0.0.1]) by smtp5.gts.sk (Postfix) with ESMTP id E9920E8069 for <bug-coreutils@HIDDEN>; Sun, 6 Apr 2014 01:17:20 +0200 (CEST) X-Virus-Scanned: amavisd-new at nextra.sk Received: from smtp5.gts.sk ([195.168.0.153]) by localhost (smtp.gts.sk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FCLwGCwX2sYd for <bug-coreutils@HIDDEN>; Sun, 6 Apr 2014 01:17:19 +0200 (CEST) Received: from [10.1.2.4] (188-167-225-220.dynamic.chello.sk [188.167.225.220]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nkame@HIDDEN) by smtp5.gts.sk (Postfix) with ESMTPSA id 6C90DE807B for <bug-coreutils@HIDDEN>; Sun, 6 Apr 2014 01:17:19 +0200 (CEST) Message-ID: <53408EFF.7050601@HIDDEN> Date: Sun, 06 Apr 2014 01:17:19 +0200 From: Jan Novak <jn@HIDDEN> User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: bug-coreutils@HIDDEN Subject: UTF-8 printf string formating problem Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -2.8 (--) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 05 Apr 2014 19:21:31 -0400 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.8 (--) Hello, printf string format counts bytes instead of chars, which leads to broken= output ... (the same problem occurs with bash built in printf) just try this: $ echo $LANG us_US.UTF-8 $ printf "|%3s|\n" "a" | a| $ printf "|%3s|\n" "=C3=A1" (char is a-acute) | =C3=A1| expected output: | =C3=A1| Is there some easy solution ? TIA for the answer Best regards Novak
Jan Novak <jn@HIDDEN>
:bug-coreutils@HIDDEN
.
Full text available.bug-coreutils@HIDDEN
:bug#17196
; Package coreutils
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.