GNU bug report logs -
#69951
coreutils: printf formatting bug for nb_NO and nn_NO locales
Previous Next
Reported by: Thomas Dreibholz <dreibh <at> simula.no>
Date: Fri, 22 Mar 2024 22:11:01 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69951 in the body.
You can then email your comments to 69951 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#69951
; Package
coreutils
.
(Fri, 22 Mar 2024 22:11:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Thomas Dreibholz <dreibh <at> simula.no>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Fri, 22 Mar 2024 22:11:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
I just discovered a printf bug for at least the nb_NO and nn_NO locales
when printing numbers with thousands separator. To reproduce:
#!/bin/bash
for l in de_DE en_US nb_NO ; do
echo "LC_NUMERIC=$l.UTF-8"
for n in 1 100 1000 10000 100000 1000000 10000000 ; do
LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
done
done
The expected output of "%'10d" is a right-formatted number string with
10 characters.
The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8
and LC_NUMERIC=en_US.UTF-8:
LC_NUMERIC=de_DE.UTF-8
< 1>
< 100>
< 1.000>
< 10.000>
< 100.000>
< 1.000.000>
<10.000.000>
LC_NUMERIC=en_US.UTF-8
< 1>
< 100>
< 1,000>
< 10,000>
< 100,000>
< 1,000,000>
<10,000,000>
However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the
formatting is wrong:
LC_NUMERIC=nb_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04)
as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).
Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly
better but is still wrong:
LC_NUMERIC=nb_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
< 1>
< 100>
< 1 000>
< 10 000>
< 100 000>
<1 000 000>
<10 000 000>
May be the issue is that the thousands separator for the Norwegian
locales is a space " ", while it is "."/"," for German/US English locales.
--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen
=======================================================================
Thomas Dreibholz
Simula Metropolitan Centre for Digital Engineering
Centre for Resilient Networks and Applications
Pilestredet 52
0167 Oslo, Norway
-----------------------------------------------------------------------
E-Mail:dreibh <at> simula.no
Homepage:http://simula.no/people/dreibh
=======================================================================
[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#69951
; Package
coreutils
.
(Sat, 23 Mar 2024 14:41:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 69951 <at> debbugs.gnu.org (full text, mbox):
tag 69951 notabug
close 69951
stop
On 22/03/2024 20:22, Thomas Dreibholz wrote:
> Hi,
>
> I just discovered a printf bug for at least the nb_NO and nn_NO locales
> when printing numbers with thousands separator. To reproduce:
>
> #!/bin/bash
> for l in de_DE nb_NO ; do
> echo "LC_NUMERIC=$l.UTF-8"
> for n in 1 100 1000 10000 100000 1000000 10000000 ; do
> LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
> done
> done
>
> The expected output of "%'10d" is a right-formatted number string with
> 10 characters.
>
> The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8
> and LC_NUMERIC=en_US.UTF-8:
>
> LC_NUMERIC=de_DE.UTF-8
> < 1>
> < 100>
> < 1.000>
> < 10.000>
> < 100.000>
> < 1.000.000>
> <10.000.000>
> However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the
> formatting is wrong:
>
> LC_NUMERIC=nb_NO.UTF-8
> < 1>
> < 100>
> < 1 000>
> < 10 000>
> < 100 000>
> <1 000 000>
> <10 000 000>
> I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04)
> as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).
>
> Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly
> better but is still wrong:
>
> LC_NUMERIC=nb_NO.UTF-8
> < 1>
> < 100>
> < 1 000>
> < 10 000>
> < 100 000>
> <1 000 000>
> <10 000 000>
> LC_NUMERIC=nn_NO.UTF-8
> < 1>
> < 100>
> < 1 000>
> < 10 000>
> < 100 000>
> <1 000 000>
> <10 000 000>
>
> May be the issue is that the thousands separator for the Norwegian
> locales is a space " ", while it is "."/"," for German/US English locales.
The issue looks to be that the thousands separator for Norwegian locales
is “NARROW NO-BREAK SPACE", or more problematically the _three_ byte
UTF8 sequence E2 80 AF. So it looks like an issue with libc routines
counting bytes rather than characters in this case.
One suggestion is to do the alignment after. For example:
$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'.f\n" $(seq -f '1E%.f' 7) | column --table-right=1 -t
10
100
1 000
10 000
100 000
1 000 000
10 000 000
Actually I've just noticed that specifying the %'10.f format
does count characters and not bytes! So another solution is:
$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'10.f\n" $(seq -f '1E%.f' 7)
10
100
1 000
10 000
100 000
1 000 000
10 000 000
The issue if there is one is in libc at least.
It would be worth checking existing glibc reports about this
and reporting if not mentioned.
cheers,
Pádraig.
Added tag(s) notabug.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sat, 23 Mar 2024 14:56:01 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
69951 <at> debbugs.gnu.org and Thomas Dreibholz <dreibh <at> simula.no>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sat, 23 Mar 2024 14:56:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#69951
; Package
coreutils
.
(Sat, 23 Mar 2024 17:29:01 GMT)
Full text and
rfc822 format available.
Message #15 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
some further debugging of a hexdump output of printf, i.e.:
#!/bin/bash
for l in de_DE en_US nb_NO nn_NO ; do
echo "LC_NUMERIC=$l.UTF-8"
for n in 1 100 1000 10000 100000 1000000 10000000 ; do
LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>" $n | hexdump -C
done
done
The output is:
...
LC_NUMERIC=nb_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 20 20 31 3e |< 1>|
0000000c
00000000 3c 20 20 20 20 20 20 20 31 30 30 3e |< 100>|
0000000c
00000000 3c 20 20 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000c
00000000 3c 20 20 31 30 e2 80 af 30 30 30 3e |< 10...000>|
0000000c
00000000 3c 20 31 30 30 e2 80 af 30 30 30 3e |< 100...000>|
0000000c
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e
|<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e
|<10...000...000>|
00000010
LC_NUMERIC=nn_NO.UTF-8
00000000 3c 20 20 20 20 20 20 20 20 20 31 3e |< 1>|
0000000c
00000000 3c 20 20 20 20 20 20 20 31 30 30 3e |< 100>|
0000000c
00000000 3c 20 20 20 31 e2 80 af 30 30 30 3e |< 1...000>|
0000000c
00000000 3c 20 20 31 30 e2 80 af 30 30 30 3e |< 10...000>|
0000000c
00000000 3c 20 31 30 30 e2 80 af 30 30 30 3e |< 100...000>|
0000000c
00000000 3c 31 e2 80 af 30 30 30 e2 80 af 30 30 30 3e
|<1...000...000>|
0000000f
00000000 3c 31 30 e2 80 af 30 30 30 e2 80 af 30 30 30 3e
|<10...000...000>|
00000010
printf seems to insert a 3-byte UTF-8 character 0xe2 0x80 0xaf as
thousands separator. "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE ->
https://www.fileformat.info/info/unicode/char/202f/index.htm
<https://www.fileformat.info/info/unicode/char/202f/index.htm> . But
terminal output (tested with Konsole and XTerm) has fixed spacing, so
"narrow space" should probably be a regular space or regular
non-breakable space (0xc2 0xa0, HTML " ")? Note that also
LibreOffice cannot produce a correct screen output with UTF-8 NARROW
NO-BREAK SPACE, even with proportional fonts, when loading the output of
the test script as a text file.
Screenshots for illustration:
* Terminal output:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758462/+files/Screenshot_20240322_213947.png
* LibreOffice output:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758464/+files/Screenshot_20240322_222052.png
--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen
=======================================================================
Thomas Dreibholz
Simula Metropolitan Centre for Digital Engineering
Centre for Resilient Networks and Applications
Pilestredet 52
0167 Oslo, Norway
-----------------------------------------------------------------------
E-Mail:dreibh <at> simula.no
Homepage:http://simula.no/people/dreibh
=======================================================================
[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#69951
; Package
coreutils
.
(Sat, 23 Mar 2024 18:26:02 GMT)
Full text and
rfc822 format available.
Message #18 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
indeed, the issue seems to be in libc. I can reproduce the problem with
a simple C program:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main(int argc, char** argv)
{
setlocale (LC_ALL, "");
struct lconv* loc = localeconv();
printf("Thousands Separator: <%s>\n", loc->thousands_sep);
for(int i = 1; i <argc; i++) {
int n = atoi(argv[i]);
double f = atof(argv[i]);
printf("double <%'10.0f>\tint <%'10d>\n", f, n);
}
return 0;
}
Output with LC_NUMERIC=nb_NO.UTF-8:
Thousands Separator: < >
double < 1> int < 1>
double < 10> int < 10>
double < 100> int < 100>
double < 1 000> int < 1 000>
double < 10 000> int < 10 000>
double < 100 000> int < 100 000>
double < 1 000 000> int <1 000 000>
double <10 000 000> int <10 000 000>
So, for a float (%f), the output is as expected, while it is wrong for
an integer (%d).
--
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen
=======================================================================
Thomas Dreibholz
Simula Metropolitan Centre for Digital Engineering
Centre for Resilient Networks and Applications
Pilestredet 52
0167 Oslo, Norway
-----------------------------------------------------------------------
E-Mail:dreibh <at> simula.no
Homepage:http://simula.no/people/dreibh
=======================================================================
[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 21 Apr 2024 11:24:15 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.