GNU bug report logs - #69951
coreutils: printf formatting bug for nb_NO and nn_NO locales

Previous Next

Package: coreutils;

Reported by: Thomas Dreibholz <dreibh <at> simula.no>

Date: Fri, 22 Mar 2024 22:11:01 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69951 in the body.
You can then email your comments to 69951 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#69951; Package coreutils. (Fri, 22 Mar 2024 22:11:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Thomas Dreibholz <dreibh <at> simula.no>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 22 Mar 2024 22:11:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Dreibholz <dreibh <at> simula.no>
To: bug-coreutils <at> gnu.org
Subject: coreutils: printf formatting bug for nb_NO and nn_NO locales
Date: Fri, 22 Mar 2024 21:22:30 +0100
[Message part 1 (text/plain, inline)]
Hi,

I just discovered a printf bug for at least the nb_NO and nn_NO locales 
when printing numbers with thousands separator. To reproduce:

#!/bin/bash
for l in de_DE en_US nb_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
   done
done

The expected output of "%'10d" is a right-formatted number string with 
10 characters.

The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8 
and LC_NUMERIC=en_US.UTF-8:

LC_NUMERIC=de_DE.UTF-8
<         1>
<       100>
<     1.000>
<    10.000>
<   100.000>
< 1.000.000>
<10.000.000>
LC_NUMERIC=en_US.UTF-8
<         1>
<       100>
<     1,000>
<    10,000>
<   100,000>
< 1,000,000>
<10,000,000>

However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the 
formatting is wrong:

LC_NUMERIC=nb_NO.UTF-8
<         1>
<       100>
<   1 000>
<  10 000>
< 100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
<         1>
<       100>
<   1 000>
<  10 000>
< 100 000>
<1 000 000>
<10 000 000>

I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04) 
as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).

Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly 
better but is still wrong:

LC_NUMERIC=nb_NO.UTF-8
<         1>
<       100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>
LC_NUMERIC=nn_NO.UTF-8
<         1>
<       100>
<    1 000>
<   10 000>
<  100 000>
<1 000 000>
<10 000 000>

May be the issue is that the thousands separator for the Norwegian 
locales is a space " ", while it is "."/"," for German/US English locales.

-- 
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

=======================================================================
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
-----------------------------------------------------------------------
 E-Mail:dreibh <at> simula.no
 Homepage:http://simula.no/people/dreibh
=======================================================================

[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#69951; Package coreutils. (Sat, 23 Mar 2024 14:41:02 GMT) Full text and rfc822 format available.

Message #8 received at 69951 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Thomas Dreibholz <dreibh <at> simula.no>, 69951 <at> debbugs.gnu.org
Subject: Re: coreutils: printf formatting bug for nb_NO and nn_NO locales
Date: Sat, 23 Mar 2024 14:39:04 +0000
tag 69951 notabug
close 69951
stop

On 22/03/2024 20:22, Thomas Dreibholz wrote:
> Hi,
> 
> I just discovered a printf bug for at least the nb_NO and nn_NO locales
> when printing numbers with thousands separator. To reproduce:
> 
> #!/bin/bash
> for l in de_DE nb_NO ; do
>      echo "LC_NUMERIC=$l.UTF-8"
>      for n in 1 100 1000 10000 100000 1000000 10000000 ; do
>         LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>\n" $n
>      done
> done
> 
> The expected output of "%'10d" is a right-formatted number string with
> 10 characters.
> 
> The output of the test script is fine for e.g. LC_NUMERIC=de_DE.UTF-8
> and LC_NUMERIC=en_US.UTF-8:
> 
> LC_NUMERIC=de_DE.UTF-8
> <         1>
> <       100>
> <     1.000>
> <    10.000>
> <   100.000>
> < 1.000.000>
> <10.000.000>

> However, for LC_NUMERIC=nb_NO.UTF-8 and LC_NUMERIC=nn_NO.UTF-8, the
> formatting is wrong:
> 
> LC_NUMERIC=nb_NO.UTF-8
> <         1>
> <       100>
> <   1 000>
> <  10 000>
> < 100 000>
> <1 000 000>
> <10 000 000>

> I reproduced the issue with coreutils-8.32-4.1ubuntu1.1 (Ubuntu 22.04)
> as well as coreutils-9.3-5.fc39.x86_64 (Fedora 39).
> 
> Under FreeBSD 14.0-RELEASE (coreutils-9.4_1), the output looks slightly
> better but is still wrong:
> 
> LC_NUMERIC=nb_NO.UTF-8
> <         1>
> <       100>
> <    1 000>
> <   10 000>
> <  100 000>
> <1 000 000>
> <10 000 000>
> LC_NUMERIC=nn_NO.UTF-8
> <         1>
> <       100>
> <    1 000>
> <   10 000>
> <  100 000>
> <1 000 000>
> <10 000 000>
> 
> May be the issue is that the thousands separator for the Norwegian
> locales is a space " ", while it is "."/"," for German/US English locales.

The issue looks to be that the thousands separator for Norwegian locales
is “NARROW NO-BREAK SPACE", or more problematically the _three_ byte
UTF8 sequence E2 80 AF. So it looks like an issue with libc routines
counting bytes rather than characters in this case.

One suggestion is to do the alignment after. For example:

$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'.f\n" $(seq -f '1E%.f' 7) | column --table-right=1 -t
        10
       100
     1 000
    10 000
   100 000
 1 000 000
10 000 000

Actually I've just noticed that specifying the %'10.f format
does count characters and not bytes! So another solution is:

$ export LC_NUMERIC=nb_NO.UTF-8
$ printf "%'10.f\n" $(seq -f '1E%.f' 7)
        10
       100
     1 000
    10 000
   100 000
 1 000 000
10 000 000

The issue if there is one is in libc at least.
It would be worth checking existing glibc reports about this
and reporting if not mentioned.

cheers,
Pádraig.




Added tag(s) notabug. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sat, 23 Mar 2024 14:56:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 69951 <at> debbugs.gnu.org and Thomas Dreibholz <dreibh <at> simula.no> Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sat, 23 Mar 2024 14:56:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#69951; Package coreutils. (Sat, 23 Mar 2024 17:29:01 GMT) Full text and rfc822 format available.

Message #15 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Dreibholz <dreibh <at> simula.no>
To: bug-coreutils <at> gnu.org
Subject: bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO
 locales
Date: Sat, 23 Mar 2024 12:39:59 +0100
[Message part 1 (text/plain, inline)]
Hi,

some further debugging of a hexdump output of printf, i.e.:

#!/bin/bash
for l in de_DE en_US nb_NO nn_NO ; do
   echo "LC_NUMERIC=$l.UTF-8"
   for n in 1 100 1000 10000 100000 1000000 10000000 ; do
      LC_NUMERIC=$l.UTF-8 /usr/bin/printf "<%'10d>" $n | hexdump -C
   done
done

The output is:

...
LC_NUMERIC=nb_NO.UTF-8
00000000  3c 20 20 20 20 20 20 20  20 20 31 3e              |<         1>|
0000000c
00000000  3c 20 20 20 20 20 20 20  31 30 30 3e              |<       100>|
0000000c
00000000  3c 20 20 20 31 e2 80 af  30 30 30 3e              |<   1...000>|
0000000c
00000000  3c 20 20 31 30 e2 80 af  30 30 30 3e              |<  10...000>|
0000000c
00000000  3c 20 31 30 30 e2 80 af  30 30 30 3e              |< 100...000>|
0000000c
00000000  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
    |<1...000...000>|
0000000f
00000000  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|
00000010
LC_NUMERIC=nn_NO.UTF-8
00000000  3c 20 20 20 20 20 20 20  20 20 31 3e              |<         1>|
0000000c
00000000  3c 20 20 20 20 20 20 20  31 30 30 3e              |<       100>|
0000000c
00000000  3c 20 20 20 31 e2 80 af  30 30 30 3e              |<   1...000>|
0000000c
00000000  3c 20 20 31 30 e2 80 af  30 30 30 3e              |<  10...000>|
0000000c
00000000  3c 20 31 30 30 e2 80 af  30 30 30 3e              |< 100...000>|
0000000c
00000000  3c 31 e2 80 af 30 30 30  e2 80 af 30 30 30 3e 
    |<1...000...000>|
0000000f
00000000  3c 31 30 e2 80 af 30 30  30 e2 80 af 30 30 30 3e 
 |<10...000...000>|
00000010

printf seems to insert a 3-byte UTF-8 character 0xe2 0x80 0xaf as 
thousands separator. "0xe2 0x80 0xaf" is UTF-8 NARROW NO-BREAK SPACE -> 
https://www.fileformat.info/info/unicode/char/202f/index.htm 
<https://www.fileformat.info/info/unicode/char/202f/index.htm> . But 
terminal output (tested with Konsole and XTerm) has fixed spacing, so 
"narrow space" should probably be a regular space or regular 
non-breakable space (0xc2 0xa0, HTML "&nbsp;")? Note that also 
LibreOffice cannot produce a correct screen output with UTF-8 NARROW 
NO-BREAK SPACE, even with proportional fonts, when loading the output of 
the test script as a text file.

Screenshots for illustration:

 * Terminal output:
   https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758462/+files/Screenshot_20240322_213947.png
 * LibreOffice output:
   https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/2058775/+attachment/5758464/+files/Screenshot_20240322_222052.png

-- 
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

=======================================================================
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
-----------------------------------------------------------------------
 E-Mail:dreibh <at> simula.no
 Homepage:http://simula.no/people/dreibh
=======================================================================

[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#69951; Package coreutils. (Sat, 23 Mar 2024 18:26:02 GMT) Full text and rfc822 format available.

Message #18 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Dreibholz <dreibh <at> simula.no>
To: P <at> draigBrady.com
Cc: bug-coreutils <at> gnu.org
Subject: bug#69951: coreutils: printf formatting bug for nb_NO and nn_NO
 locales
Date: Sat, 23 Mar 2024 19:17:02 +0100
[Message part 1 (text/plain, inline)]
Hi,

indeed, the issue seems to be in libc. I can reproduce the problem with 
a simple C program:

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char** argv)
{
   setlocale (LC_ALL, "");

   struct lconv* loc = localeconv();
   printf("Thousands Separator: <%s>\n", loc->thousands_sep);

   for(int i = 1; i <argc; i++) {
      int    n = atoi(argv[i]);
      double f = atof(argv[i]);
      printf("double <%'10.0f>\tint <%'10d>\n", f, n);
   }
   return 0;
}

Output with LC_NUMERIC=nb_NO.UTF-8:

Thousands Separator: < >
double <         1>     int <         1>
double <        10>     int <        10>
double <       100>     int <       100>
double <     1 000>     int <   1 000>
double <    10 000>     int <  10 000>
double <   100 000>     int < 100 000>
double < 1 000 000>     int <1 000 000>
double <10 000 000>     int <10 000 000>

So, for a float (%f), the output is as expected, while it is wrong for 
an integer (%d).

-- 
Best regards / Mit freundlichen Grüßen / Med vennlig hilsen

=======================================================================
 Thomas Dreibholz

 Simula Metropolitan Centre for Digital Engineering
 Centre for Resilient Networks and Applications
 Pilestredet 52
 0167 Oslo, Norway
-----------------------------------------------------------------------
 E-Mail:dreibh <at> simula.no
 Homepage:http://simula.no/people/dreibh
=======================================================================

[Message part 2 (text/html, inline)]
[OpenPGP_signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 21 Apr 2024 11:24:15 GMT) Full text and rfc822 format available.

This bug report was last modified 11 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.