GNU bug report logs -
#17196
multibyte: printf: %s counts bytes instead of characters
Previous Next
To reply to this bug, email your comments to 17196 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Sat, 05 Apr 2014 23:22:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jan Novak <jn <at> turbo.sk>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 05 Apr 2014 23:22:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
printf string format counts bytes instead of chars, which leads to broken output ...
(the same problem occurs with bash built in printf)
just try this:
$ echo $LANG
us_US.UTF-8
$ printf "|%3s|\n" "a"
| a|
$ printf "|%3s|\n" "á" (char is a-acute)
| á|
expected output:
| á|
Is there some easy solution ?
TIA for the answer
Best regards
Novak
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Sun, 06 Apr 2014 10:16:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 04/06/2014 12:17 AM, Jan Novak wrote:
> Hello,
>
> printf string format counts bytes instead of chars, which leads to broken output ...
> (the same problem occurs with bash built in printf)
>
>
> just try this:
>
> $ echo $LANG
> us_US.UTF-8
>
>
> $ printf "|%3s|\n" "a"
> | a|
>
> $ printf "|%3s|\n" "á" (char is a-acute)
> | á|
>
> expected output:
> | á|
>
> Is there some easy solution ?
>
> TIA for the answer
Yes printf follows the C standard which only considers bytes.
awk does respect characters in width specifiers though:
$ awk 'BEGIN{printf "|%3s|\n", "á"}'
| á|
I don't think we'd be able to change the current operation of printf
due to backwards compat reasons? Though we might be able to somehow leverage
the existing multibyte character aware alignment/truncation code in:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
thanks,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Sun, 06 Apr 2014 18:14:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 04/06/2014 11:15 AM, Pádraig Brady wrote:
> On 04/06/2014 12:17 AM, Jan Novak wrote:
>> Hello,
>>
>> printf string format counts bytes instead of chars, which leads to broken output ...
>> (the same problem occurs with bash built in printf)
>>
>>
>> just try this:
>>
>> $ echo $LANG
>> us_US.UTF-8
>>
>>
>> $ printf "|%3s|\n" "a"
>> | a|
>>
>> $ printf "|%3s|\n" "á" (char is a-acute)
>> | á|
>>
>> expected output:
>> | á|
>>
>> Is there some easy solution ?
>>
>> TIA for the answer
>
> Yes printf follows the C standard which only considers bytes.
> awk does respect characters in width specifiers though:
>
> $ awk 'BEGIN{printf "|%3s|\n", "á"}'
> | á|
Jan points out to me the the awk solution is not portable
to mawk 1.3.3 at least. I used GNU Awk 3.1.8 above.
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Sun, 06 Apr 2014 18:25:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 17196 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> Yes printf follows the C standard which only considers bytes.
> ...
> I don't think we'd be able to change the current operation of printf
> due to backwards compat reasons? Though we might be able to somehow leverage
> the existing multibyte character aware alignment/truncation code in:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
Dan Douglas pointed out in the corresponding discussion in bug-bash
that ksh uses the L modifier.
http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
Dan Douglas wrote:
> ksh93 already has this feature using the "L" modifier:
>
> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> ★★★
At least there is prior art for it.
Bob
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Mon, 07 Apr 2014 13:09:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 04/06/2014 07:24 PM, Bob Proulx wrote:
> Pádraig Brady wrote:
>> Yes printf follows the C standard which only considers bytes.
>> ...
>> I don't think we'd be able to change the current operation of printf
>> due to backwards compat reasons? Though we might be able to somehow leverage
>> the existing multibyte character aware alignment/truncation code in:
>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>
> Dan Douglas pointed out in the corresponding discussion in bug-bash
> that ksh uses the L modifier.
>
> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>
> Dan Douglas wrote:
> > ksh93 already has this feature using the "L" modifier:
> >
> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> > ★★★
>
> At least there is prior art for it.
So we can count bytes, chars or cells (graphemes).
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
In the edge case where you want to deal in bytes one can do:
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
though requires the explicit %L enabler:
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★★
$ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
A★
$ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
A
zsh seems to just count characters:
$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
A★★
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
Personally I would go for:
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Mon, 07 Apr 2014 21:42:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 17196 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> Pity there is no consensus here.
> Personally I would go for:
> printf '%3s' 'blah' # count cells
> printf '%3Ls' 'blah' # count chars
> LANG=C '%3Ls' 'blah' # count bytes
> LANG=C '%3s' 'blah' # count bytes
I vote for it ...
it is excellent idea, that "standard" notation works properly in localized environment !
(because this is exactly what users expect)
Thanks !
novak
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Mon, 07 Apr 2014 21:58:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 17196 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
[adding the Austin Group]
On 04/07/2014 07:08 AM, Pádraig Brady wrote:
> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>> Pádraig Brady wrote:
>>> Yes printf follows the C standard which only considers bytes.
>>> ...
>>> I don't think we'd be able to change the current operation of printf
>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>> the existing multibyte character aware alignment/truncation code in:
>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>
>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>> that ksh uses the L modifier.
>>
>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>
>> Dan Douglas wrote:
>> > ksh93 already has this feature using the "L" modifier:
>> >
>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>> > ★★★
>>
>> At least there is prior art for it.
>
> So we can count bytes, chars or cells (graphemes).
>
> Thinking a bit more about it, I think shell level printf
> should be dealing in text of the current encoding and counting cells.
> In the edge case where you want to deal in bytes one can do:
> LC_ALL=C printf ...
>
> I see that ksh behaves as I would expect and counts cells,
> though requires the explicit %L enabler:
> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> á★★
> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> A★
> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
> A
>
> zsh seems to just count characters:
> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> á★
> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
> á★
> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> A★★
>
> I see that dash gives invalid directive for any of %ls %Ls %S.
>
> Pity there is no consensus here.
> Personally I would go for:
> printf '%3s' 'blah' # count cells
> printf '%3Ls' 'blah' # count chars
> LANG=C '%3Ls' 'blah' # count bytes
> LANG=C '%3s' 'blah' # count bytes
Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.
Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Tue, 08 Apr 2014 00:12:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
>
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>> Dan Douglas wrote:
>>> > ksh93 already has this feature using the "L" modifier:
>>> >
>>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>> > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>> LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★★
>> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★
>> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>> A
>>
>> zsh seems to just count characters:
>> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>> printf '%3s' 'blah' # count cells
>> printf '%3Ls' 'blah' # count chars
>> LANG=C '%3Ls' 'blah' # count bytes
>> LANG=C '%3s' 'blah' # count bytes
>
> Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined. But I would LOVE to have a
> standardized spelling for counting characters instead of bytes. The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.
Note ksh seems to count cells with %Ls
> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes. Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.
At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).
But it's a fair point that there may be scripts
that don't consider the zsh behavior.
If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:
printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count cells
LANG=C '%3Ls' 'blah' # count bytes
This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.
thanks,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Tue, 08 Apr 2014 01:29:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 17196 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 04/07/2014 06:11 PM, Pádraig Brady wrote:
>
> If we had to make it explicit for backwards compat reasons,
> then I suppose counting by characters is the least useful,
> so we could just standardize the existing ksh behavior and have:
>
> printf '%3s' 'blah' # count bytes
> printf '%3Ls' 'blah' # count cells
> LANG=C '%3Ls' 'blah' # count bytes
If we add %3Ls to the shell, we should also add it to libc's printf(3),
which means coordinating with the C committee.
>
> This has the disadvantage of not degrading gracefully
> on dash for example where %Ls is rejected.
If a future version of the standard mandates behavior for %Ls, I suspect
dash would be made compliant fairly quickly - the dash maintainers
strive hard to comply with POSIX.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Wed, 09 Apr 2014 15:48:03 GMT)
Full text and
rfc822 format available.
Message #32 received at 17196 <at> debbugs.gnu.org (full text, mbox):
Eric Blake <eblake <at> redhat.com> wrote:
|>> Dan Douglas wrote:
|>>> ksh93 already has this feature using the "L" modifier:
|>>>
|>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
|>>> ★★★
|>>
|>> At least there is prior art for it.
|>
|> So we can count bytes, chars or cells (graphemes).
|>
|> Thinking a bit more about it, I think shell level printf
|> should be dealing in text of the current encoding and counting cells.
|> In the edge case where you want to deal in bytes one can do:
|> LC_ALL=C printf ...
|>
|> I see that ksh behaves as I would expect and counts cells,
|> though requires the explicit %L enabler:
|> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★★
|> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
|> A★
|> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
|> A
|>
|> zsh seems to just count characters:
|> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
|> A★★
|>
|> I see that dash gives invalid directive for any of %ls %Ls %S.
|>
|> Pity there is no consensus here.
|> Personally I would go for:
|> printf '%3s' 'blah' # count cells
|> printf '%3Ls' 'blah' # count chars
|> LANG=C '%3Ls' 'blah' # count bytes
|> LANG=C '%3s' 'blah' # count bytes
|
|Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
|and currently states that %Ls is undefined. But I would LOVE to have a
|standardized spelling for counting characters instead of bytes. The
|extension %Ls looks like a good candidate for standardization, precisely
|because counting characters when printing a multibyte string is more
|useful than counting bytes (you do NOT want to end in the middle of a
|multibyte character), and because ksh offers it as existing practice.
|
|Your idea for counting "cells" (by which I'm assuming you mean one or
|more characters that all display within the same cell of the terminal,
|as if the end user saw only one grapheme), on the other hand, does not
|seem to have any precedence, and I would strongly object to having %s
|count by cells because %s already has a standardized (if unfortunate)
|meaning of counting by bytes. Maybe yet another extension is warranted
|(perhaps %LLs?) as a new notion for counting by cells instead of
|characters, but it's harder to justify that without existing practice.
I see you are trying to invent the word character for code points
and reserve the term "graphem" for user-perceived characters.
This goes in line with the GNU library which has the existing
practice to let wcwidth(3) return the value 1 for accents and
other combining code points as well as so-called (Unicode)
noncharacters. And who would call wcwidth(3) on something that is
not to be drawn onto the screen directly afterwards. And, of
course, which terminal will perform the composition of code points
written via STD I/O to characters on its own.
I think for quite a while it is up to the input methods to combine
into something precomposed in order to let POSIX programs finally
work with it.
--steffen
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Thu, 10 Apr 2014 07:57:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <eblake <at> redhat.com> wrote:
> |>> Dan Douglas wrote:
> |>>> ksh93 already has this feature using the "L" modifier:
> |>>>
> |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> |>>> ★★★
> |>>
> |>> At least there is prior art for it.
> |>
> |> So we can count bytes, chars or cells (graphemes).
> |>
> |> Thinking a bit more about it, I think shell level printf
> |> should be dealing in text of the current encoding and counting cells.
> |> In the edge case where you want to deal in bytes one can do:
> |> LC_ALL=C printf ...
> |>
> |> I see that ksh behaves as I would expect and counts cells,
> |> though requires the explicit %L enabler:
> |> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★★
> |> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> |> A★
> |> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
> |> A
> |>
> |> zsh seems to just count characters:
> |> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★
> |> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★
> |> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> |> A★★
> |>
> |> I see that dash gives invalid directive for any of %ls %Ls %S.
> |>
> |> Pity there is no consensus here.
> |> Personally I would go for:
> |> printf '%3s' 'blah' # count cells
> |> printf '%3Ls' 'blah' # count chars
> |> LANG=C '%3Ls' 'blah' # count bytes
> |> LANG=C '%3s' 'blah' # count bytes
> |
> |Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
> |and currently states that %Ls is undefined. But I would LOVE to have a
> |standardized spelling for counting characters instead of bytes. The
> |extension %Ls looks like a good candidate for standardization, precisely
> |because counting characters when printing a multibyte string is more
> |useful than counting bytes (you do NOT want to end in the middle of a
> |multibyte character), and because ksh offers it as existing practice.
> |
> |Your idea for counting "cells" (by which I'm assuming you mean one or
> |more characters that all display within the same cell of the terminal,
> |as if the end user saw only one grapheme), on the other hand, does not
> |seem to have any precedence, and I would strongly object to having %s
> |count by cells because %s already has a standardized (if unfortunate)
> |meaning of counting by bytes. Maybe yet another extension is warranted
> |(perhaps %LLs?) as a new notion for counting by cells instead of
> |characters, but it's harder to justify that without existing practice.
>
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters. And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards. And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.
Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.
Rich
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Thu, 10 Apr 2014 16:17:04 GMT)
Full text and
rfc822 format available.
Message #38 received at 17196 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Rich Felker <dalias <at> aerifal.cx> wrote:
|On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
|> Eric Blake <eblake <at> redhat.com> wrote:
|>|Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
|>|and currently states that %Ls is undefined. But I would LOVE to have a
|>|standardized spelling for counting characters instead of bytes. The
|>|extension %Ls looks like a good candidate for standardization, precisely
|>|because counting characters when printing a multibyte string is more
|>|useful than counting bytes (you do NOT want to end in the middle of a
|>|multibyte character), and because ksh offers it as existing practice.
|>|
|>|Your idea for counting "cells" (by which I'm assuming you mean one or
|>|more characters that all display within the same cell of the terminal,
|>|as if the end user saw only one grapheme), on the other hand, does not
|>|seem to have any precedence, and I would strongly object to having %s
[.]
|> I see you are trying to invent the word character for code points
|> and reserve the term "graphem" for user-perceived characters.
|> This goes in line with the GNU library which has the existing
|> practice to let wcwidth(3) return the value 1 for accents and
|> other combining code points as well as so-called (Unicode)
|> noncharacters. And who would call wcwidth(3) on something that is
|> not to be drawn onto the screen directly afterwards. And, of
|> course, which terminal will perform the composition of code points
|> written via STD I/O to characters on its own.
|> I think for quite a while it is up to the input methods to combine
|> into something precomposed in order to let POSIX programs finally
|> work with it.
|
|Many languages do not have precomposed forms for all the character
|sequences they need, and for some, it would not even be practical to
|have precomposed forms, and would force the use of complex input
|methods instead of simple keyboard maps.
And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be «ü» but also U+0076 U+0308 «u ̈», dependent on where it
came from. And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at «u» plus, actually non-,
combining diaeresis. (In fact actually «combining with a space».)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.
Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that «user-perceived characters don't have any
precedence». A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.
--steffen
[Message part 2 (message/rfc822, inline)]
On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <eblake <at> redhat.com> wrote:
> |>> Dan Douglas wrote:
> |>>> ksh93 already has this feature using the "L" modifier:
> |>>>
> |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> |>>> ★★★
> |>>
> |>> At least there is prior art for it.
> |>
> |> So we can count bytes, chars or cells (graphemes).
> |>
> |> Thinking a bit more about it, I think shell level printf
> |> should be dealing in text of the current encoding and counting cells.
> |> In the edge case where you want to deal in bytes one can do:
> |> LC_ALL=C printf ...
> |>
> |> I see that ksh behaves as I would expect and counts cells,
> |> though requires the explicit %L enabler:
> |> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★★
> |> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> |> A★
> |> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
> |> A
> |>
> |> zsh seems to just count characters:
> |> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★
> |> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
> |> á★
> |> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> |> A★★
> |>
> |> I see that dash gives invalid directive for any of %ls %Ls %S.
> |>
> |> Pity there is no consensus here.
> |> Personally I would go for:
> |> printf '%3s' 'blah' # count cells
> |> printf '%3Ls' 'blah' # count chars
> |> LANG=C '%3Ls' 'blah' # count bytes
> |> LANG=C '%3s' 'blah' # count bytes
> |
> |Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
> |and currently states that %Ls is undefined. But I would LOVE to have a
> |standardized spelling for counting characters instead of bytes. The
> |extension %Ls looks like a good candidate for standardization, precisely
> |because counting characters when printing a multibyte string is more
> |useful than counting bytes (you do NOT want to end in the middle of a
> |multibyte character), and because ksh offers it as existing practice.
> |
> |Your idea for counting "cells" (by which I'm assuming you mean one or
> |more characters that all display within the same cell of the terminal,
> |as if the end user saw only one grapheme), on the other hand, does not
> |seem to have any precedence, and I would strongly object to having %s
> |count by cells because %s already has a standardized (if unfortunate)
> |meaning of counting by bytes. Maybe yet another extension is warranted
> |(perhaps %LLs?) as a new notion for counting by cells instead of
> |characters, but it's harder to justify that without existing practice.
>
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters. And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards. And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.
Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.
Rich
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Thu, 10 Apr 2014 18:12:01 GMT)
Full text and
rfc822 format available.
Message #41 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
> Even better would nonetheless be the great picture with
> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
> a tuple of {[EastAsianWidth indication,] is-combining,
> width-if-non-combining} and best even some composition function.
But we have always been at war with EastAsia!
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU chet <at> case.edu http://cnswww.cns.cwru.edu/~chet/
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Fri, 11 Apr 2014 10:17:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 17196 <at> debbugs.gnu.org (full text, mbox):
Hello,
Chet Ramey <chet.ramey <at> case.edu> wrote:
|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
|
|> Even better would nonetheless be the great picture with
|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
|> a tuple of {[EastAsianWidth indication,] is-combining,
|> width-if-non-combining} and best even some composition function.
|
|But we have always been at war with EastAsia!
I see you really would love to get a hand from POSIX too:
?0[steffen <at> sherwood bash-4.3]$ grep -r UNICODE_COMB .
./lib/readline/display.c: if (t > 0 && UNICODE_COMBINING_CHAR (wc) && WCWIDTH (wc) == 0)
./lib/readline/rlmbutil.h:#define UNICODE_COMBINING_CHAR(x) ((x) >= 768 && (x) <= 879)
./lib/readline/rlmbutil.h:# define WCWIDTH(wc) ((_rl_utf8locale && UNICODE_COMBINING_CHAR(wc)) ? 0 : wcwidth(wc))
And sorry for not making this clear for those who never dealt with
the problem (which is probably not uncommon for filesystem or
other kernel hackers): `EastAsianWidth' refers to a property of
Unicode and ISO 10646:
# EastAsianWidth-6.3.0.txt
# Date: 2013-02-05, 20:09:00 GMT [KW, LI]
#
# East Asian Width Properties
#
# This file is an informative contributory data file in the
# Unicode Character Database.
#
# Copyright (c) 1991-2013 Unicode, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
--steffen
...
To be honest i must admit i first was pissed, so let me append the
original first part of this message, please:
and so the landslide had brought it down.
But i would quote Paul Vixie, who stated in a todays' message
gentlemen and ladies, we have met the enemy, and they are our
egos.
vixie
From my point of view it's the matter of culture and philosophy
(including religion) how to deal with that very problem.
And i can assure you that Jehovas Witnesses, which visit me
regulary for some years now, like to drink a bit of my Buddhistic
tea.
Paul Vixie is correct.
I am stupid.
With greetings from someone who will undergo his 42nd birthday soon
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Fri, 11 Apr 2014 12:27:02 GMT)
Full text and
rfc822 format available.
Message #47 received at 17196 <at> debbugs.gnu.org (full text, mbox):
On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote:
> Hello,
>
> Chet Ramey <chet.ramey <at> case.edu> wrote:
> |On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
> |
> |> Even better would nonetheless be the great picture with
> |> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
> |> a tuple of {[EastAsianWidth indication,] is-combining,
> |> width-if-non-combining} and best even some composition function.
> |
> |But we have always been at war with EastAsia!
>
> I see you really would love to get a hand from POSIX too:
I'm sorry, I realize that was rather obscure. It's from "1984", by George
Orwell. It's a central theme to the book. The quote was an attempt to
inject levity into the discussion.
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU chet <at> case.edu http://cnswww.cns.cwru.edu/~chet/
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Fri, 11 Apr 2014 13:42:01 GMT)
Full text and
rfc822 format available.
Message #50 received at 17196 <at> debbugs.gnu.org (full text, mbox):
Chet Ramey <chet.ramey <at> case.edu> wrote:
|On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote:
|> Hello,
|>
|> Chet Ramey <chet.ramey <at> case.edu> wrote:
|>|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
|>|
|>|> Even better would nonetheless be the great picture with
|>|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
|>|> a tuple of {[EastAsianWidth indication,] is-combining,
|>|> width-if-non-combining} and best even some composition function.
|>|
|>|But we have always been at war with EastAsia!
|>
|> I see you really would love to get a hand from POSIX too:
|
|I'm sorry, I realize that was rather obscure. It's from "1984", by George
|Orwell. It's a central theme to the book. The quote was an attempt to
oh, ah, yes. So.. i got it right without getting it right.
Interestingly, yesterday started a retrospective work on Walter
Benjamin (<http://www.eingedenken.de/enter.html> --
"rememberance"): an artist (Christoph Korn) walked hist last trip
from Banyuls-sur-Mer (France) to Portbou (Spain; where he
committed suicide due to the impossibility to reach the U.S.),
following a fixated time frame (monotonic tick, so to say) after
which he spoke thesis of Benjamin (like, e.g., "There is no
document of civilization which is not at the same time a document
of barbarism."), followed by holding in and taking a (steady cam)
video of the recent leg. Association with Paul Klees "Angelus
Novus" is desired (from both parties).
|inject levity into the discussion.
That was easy.
--steffen
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#17196
; Package
coreutils
.
(Fri, 09 May 2014 02:17:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 17196 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Perhaps printf() needs some wide character extensions via %new characters
Regards
Leslie
Mr. Leslie Satenstein
SENT FROM MY OPEN SOURCE LINUX SYSTEM.
>________________________________
> From: Pádraig Brady <P <at> draigBrady.com>
>To: Jan Novak <jn <at> turbo.sk>
>Cc: 17196 <at> debbugs.gnu.org
>Sent: Sunday, April 6, 2014 6:15 AM
>Subject: bug#17196: UTF-8 printf string formating problem
>
>
>On 04/06/2014 12:17 AM, Jan Novak wrote:
>> Hello,
>>
>> printf string format counts bytes instead of chars, which leads to broken output ...
>> (the same problem occurs with bash built in printf)
>>
>>
>> just try this:
>>
>> $ echo $LANG
>> us_US.UTF-8
>>
>>
>> $ printf "|%3s|\n" "a"
>> | a|
>>
>> $ printf "|%3s|\n" "á" (char is a-acute)
>> | á|
>>
>> expected output:
>> | á|
>>
>> Is there some easy solution ?
>>
>> TIA for the answer
>
>Yes printf follows the C standard which only considers bytes.
>awk does respect characters in width specifiers though:
>
> $ awk 'BEGIN{printf "|%3s|\n", "á"}'
> | á|
>
>I don't think we'd be able to change the current operation of printf
>due to backwards compat reasons? Though we might be able to somehow leverage
>the existing multibyte character aware alignment/truncation code in:
>http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>
>thanks,
>Pádraig.
>
>
>
>
>
>
[Message part 2 (text/html, inline)]
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Sat, 20 Oct 2018 03:20:01 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: printf: %s counts bytes instead of characters' from 'UTF-8 printf string formating problem'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Sat, 20 Oct 2018 03:20:01 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 34 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.