GNU bug report logs - #36887
coreutils-8.31: printf chokes on \u0041

Previous Next

Package: coreutils;

Reported by: Ulrich Mueller <ulm <at> gentoo.org>

Date: Thu, 1 Aug 2019 11:03:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 36887 in the body.
You can then email your comments to 36887 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Thu, 01 Aug 2019 11:03:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ulrich Mueller <ulm <at> gentoo.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 01 Aug 2019 11:03:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: bug-coreutils <at> gnu.org
Cc: base-system <at> gentoo.org
Subject: coreutils-8.31: printf chokes on \u0041
Date: Thu, 01 Aug 2019 13:02:26 +0200
[Forwarding bug https://bugs.gentoo.org/680244 as requested by the
Gentoo package maintainer.]

According to printf(1):

   Interpreted sequences are:
   [...]
   
   \uHHHH Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)

   \UHHHHHHHH
          Unicode character with hex value HHHHHHHH (8 digits)

It does not work, though:

$ /usr/bin/printf '\u0041\n'
/usr/bin/printf: invalid universal character name \u0041
$ /usr/bin/printf '\U00000041\n'
/usr/bin/printf: invalid universal character name \U00000041

Other tools interpret the sequence correctly:

$ printf '\u0041\n'   # bash
A
$ echo -e '\u0041'    # bash
A
$ zsh -c "echo -e '\u0041'"
A
$ emacs -Q --batch --eval '(princ "\u0041\n")'
A
$ python -c "print ('\u0041')"
A
$ ruby -e 'print("\u0041\n")'
A




Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Thu, 01 Aug 2019 13:10:02 GMT) Full text and rfc822 format available.

Message #8 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ulrich Mueller <ulm <at> gentoo.org>, 36887 <at> debbugs.gnu.org
Cc: base-system <at> gentoo.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Thu, 1 Aug 2019 14:09:08 +0100
On 01/08/19 12:02, Ulrich Mueller wrote:
> [Forwarding bug https://bugs.gentoo.org/680244 as requested by the
> Gentoo package maintainer.]
> 
> According to printf(1):
> 
>    Interpreted sequences are:
>    [...]
>    
>    \uHHHH Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
> 
>    \UHHHHHHHH
>           Unicode character with hex value HHHHHHHH (8 digits)
> 
> It does not work, though:
> 
> $ /usr/bin/printf '\u0041\n'
> /usr/bin/printf: invalid universal character name \u0041
> $ /usr/bin/printf '\U00000041\n'
> /usr/bin/printf: invalid universal character name \U00000041
> 
> Other tools interpret the sequence correctly:
> 
> $ printf '\u0041\n'   # bash
> A
> $ echo -e '\u0041'    # bash
> A
> $ zsh -c "echo -e '\u0041'"
> A
> $ emacs -Q --batch --eval '(princ "\u0041\n")'
> A
> $ python -c "print ('\u0041')"
> A
> $ ruby -e 'print("\u0041\n")'
> A

I agree this is a bit surprising.
The full manual states:

  "Unicode characters in the ranges
  U+0000...U+009F, U+D800...U+DFFF cannot be specified by this syntax,
  except for U+0024 ($), U+0040 (@), and U+0060 (`)."

This was previously discussed at:
https://lists.gnu.org/archive/html/bug-coreutils/2008-05/threads.html#00067




Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Thu, 01 Aug 2019 20:19:02 GMT) Full text and rfc822 format available.

Message #11 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: base-system <at> gentoo.org, 36887 <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Thu, 01 Aug 2019 22:18:41 +0200
>>>>> On Thu, 01 Aug 2019, Pádraig Brady wrote:

> I agree this is a bit surprising.

Indeed, it most certainly violates the principle of least surprise.
Especially, it means that a shell script that will run in bash won't
run in a shell that doesn't have a built-in printf.

> The full manual states:

>   "Unicode characters in the ranges
>   U+0000...U+009F, U+D800...U+DFFF cannot be specified by this syntax,
>   except for U+0024 ($), U+0040 (@), and U+0060 (`)."

> This was previously discussed at:
> https://lists.gnu.org/archive/html/bug-coreutils/2008-05/threads.html#00067

So, there are reasons for this restriction in C99. However, I fail to
see how those reasons would apply to printf. Except for the surrogates
U+D800...U+DFFF, it looks like an arbitrary restriction, which only
makes the printf implementation incompatible with other GNU programs
(like Bash and Emacs).




Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Thu, 01 Aug 2019 23:38:02 GMT) Full text and rfc822 format available.

Message #14 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ulrich Mueller <ulm <at> gentoo.org>, Pádraig Brady
 <P <at> draigBrady.com>
Cc: base-system <at> gentoo.org, 36887 <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Thu, 1 Aug 2019 16:37:44 -0700
Ulrich Mueller wrote:
> Except for the surrogates
> U+D800...U+DFFF, it looks like an arbitrary restriction

It's not entirely arbitrary. Because of the restriction, coreutils printf 
doesn't have to worry about what this command should do:

  printf '\u0025d\n' 1 2

Does this print a single line "%d", or two lines "1" and "2"? There are good 
arguments either way, and one can easily construct even-stranger examples.




Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Fri, 02 Aug 2019 08:01:02 GMT) Full text and rfc822 format available.

Message #17 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: base-system <at> gentoo.org, Pádraig Brady <P <at> draigBrady.com>,
 36887 <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Fri, 02 Aug 2019 10:00:03 +0200
>>>>> On Fri, 02 Aug 2019, Paul Eggert wrote:

> It's not entirely arbitrary. Because of the restriction, coreutils
> printf doesn't have to worry about what this command should do:

>   printf '\u0025d\n' 1 2

Seems quite obvious, it should do the same as these commands:

  printf '\045d\n' 1 2
  printf '\x25d\n' 1 2

This is different from C behaviour, because printf(3) doesn't deal with
backslash escapes at all, which are interpreted earlier during parsing
of the string literal. That's why I think the C reasoning doesn't apply
here.




Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Fri, 02 Aug 2019 10:16:01 GMT) Full text and rfc822 format available.

Message #20 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <coreutils <at> tlinx.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Ulrich Mueller <ulm <at> gentoo.org>, base-system <at> gentoo.org,
 Pádraig Brady <P <at> draigBrady.com>, 36887 <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Fri, 02 Aug 2019 03:14:51 -0700
On 2019/08/01 16:37, Paul Eggert wrote:
> Ulrich Mueller wrote:
>   
>> Except for the surrogates
>> U+D800...U+DFFF, it looks like an arbitrary restriction
>>     
>
> It's not entirely arbitrary. Because of the restriction, coreutils printf 
> doesn't have to worry about what this command should do:
>
>    printf '\u0025d\n' 1 2
>
> Does this print a single line "%d", or two lines "1" and "2"? There are good 
> arguments either way, and one can easily construct even-stranger examples.
>   
There are no format characters in the initial line, so only the 1st
argument is interpreted.  You can't do multiple interpretations since if
you do there's no stopping point, (i.e. a hex-encode of a hex-encode of
'%d\n')









Information forwarded to bug-coreutils <at> gnu.org:
bug#36887; Package coreutils. (Wed, 07 Jun 2023 14:17:01 GMT) Full text and rfc822 format available.

Message #23 received at 36887 <at> debbugs.gnu.org (full text, mbox):

From: Ulrich Mueller <ulm <at> gentoo.org>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 36887 <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Wed, 07 Jun 2023 16:16:28 +0200
Can this bug be closed? AFAICS it is fixed since coreutils-9.2.

Relevant commit:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/printf.c?id=0925e8a0f413ecf9004153d89b312b385b20d0ee




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Wed, 07 Jun 2023 14:58:01 GMT) Full text and rfc822 format available.

Notification sent to Ulrich Mueller <ulm <at> gentoo.org>:
bug acknowledged by developer. (Wed, 07 Jun 2023 14:58:02 GMT) Full text and rfc822 format available.

Message #28 received at 36887-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ulrich Mueller <ulm <at> gentoo.org>
Cc: 36887-done <at> debbugs.gnu.org
Subject: Re: bug#36887: coreutils-8.31: printf chokes on \u0041
Date: Wed, 7 Jun 2023 15:57:01 +0100
On 07/06/2023 15:16, Ulrich Mueller wrote:
> Can this bug be closed? AFAICS it is fixed since coreutils-9.2.
> 
> Relevant commit:
> https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/printf.c?id=0925e8a0f413ecf9004153d89b312b385b20d0ee

Marked as done.

thanks!
Pádraig





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 06 Jul 2023 11:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 266 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.