GNU bug report logs - #73194
ls command converts utf-8 character into escape sequences

Previous Next

Package: coreutils;

Reported by: Simon Wolfe <sekaihenodoa <at> mutsuba.info>

Date: Thu, 12 Sep 2024 10:18:01 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 73194 in the body.
You can then email your comments to 73194 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#73194; Package coreutils. (Thu, 12 Sep 2024 10:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Simon Wolfe <sekaihenodoa <at> mutsuba.info>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 12 Sep 2024 10:18:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Simon Wolfe <sekaihenodoa <at> mutsuba.info>
To: bug-coreutils <at> gnu.org
Subject: ls command converts utf-8 character into escape sequences
Date: Thu, 12 Sep 2024 19:16:21 +0900
I have one file name that uses Unicode character U+318DF, which is in the tertiary pane, more precisely CJK Unified Ideographs Extension H.

touch 𱣟
ls

returns:

''$'\360\261\243\237'

Extension H was introduced in Unicode 15.0 in 2022.

I also notice that this bug occurs with any character with Extension I (introduced in 2023).

Extension G seems to works okay.





Information forwarded to bug-coreutils <at> gnu.org:
bug#73194; Package coreutils. (Thu, 12 Sep 2024 10:37:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Wolff <towo <at> towo.net>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#73194: ls command converts utf-8 character into escape
 sequences
Date: Thu, 12 Sep 2024 12:36:05 +0200
Am 12.09.2024 um 12:16 schrieb Simon Wolfe:
> I have one file name that uses Unicode character U+318DF, which is in
> the tertiary pane, more precisely CJK Unified Ideographs Extension H.
>
> touch 𱣟
> ls
>
> returns:
>
> ''$'\360\261\243\237'
I use a wrapper with my favourite options and a pipe to stop ls from
being witty about the terminal:
ls | cat

>
> Extension H was introduced in Unicode 15.0 in 2022.
>
> I also notice that this bug occurs with any character with Extension I
> (introduced in 2023).
>
> Extension G seems to works okay.





Information forwarded to bug-coreutils <at> gnu.org:
bug#73194; Package coreutils. (Thu, 12 Sep 2024 10:44:02 GMT) Full text and rfc822 format available.

Message #11 received at 73194 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Simon Wolfe <sekaihenodoa <at> mutsuba.info>, 73194 <at> debbugs.gnu.org
Subject: Re: bug#73194: ls command converts utf-8 character into escape
 sequences
Date: Thu, 12 Sep 2024 11:42:06 +0100
On 12/09/2024 11:16, Simon Wolfe wrote:
> I have one file name that uses Unicode character U+318DF, which is in the tertiary pane, more precisely CJK Unified Ideographs Extension H.
> 
> touch 𱣟
> ls
> 
> returns:
> 
> ''$'\360\261\243\237'
> 
> Extension H was introduced in Unicode 15.0 in 2022.
> 
> I also notice that this bug occurs with any character with Extension I (introduced in 2023).
> 
> Extension G seems to works okay.

ls 9.4 works as expected for me with glibc-2.39 in a UTF-8 locale.
I.e. that file is displayed directly.
Now if I set the locale to non UTF-8 it will display the form above
(which works on all locales BTW).

  $ touch ''$'\360\261\243\237'
  $ ls ''$'\360\261\243\237'
  𱣟
  $ LC_ALL=C ls ''$'\360\261\243\237'
  ''$'\360\261\243\237'

So I suspect your system libs are not updated to recognize this character,
hence the fallback format is used.

cheers,
Pádraig.





Information forwarded to bug-coreutils <at> gnu.org:
bug#73194; Package coreutils. (Thu, 12 Sep 2024 13:15:02 GMT) Full text and rfc822 format available.

Message #14 received at 73194 <at> debbugs.gnu.org (full text, mbox):

From: Simon Wolfe <sekaihenodoa <at> mutsuba.info>
To: Pádraig Brady <P <at> draigBrady.com>, 73194 <at> debbugs.gnu.org
Subject: Re: bug#73194: ls command converts utf-8 character into escape
 sequences
Date: Thu, 12 Sep 2024 20:00:51 +0900
On 2024/09/12 19:42, Pádraig Brady wrote:
> On 12/09/2024 11:16, Simon Wolfe wrote:
>> I have one file name that uses Unicode character U+318DF, which is in the tertiary pane, more precisely CJK Unified Ideographs Extension H.
>>
>> touch 𱣟
>> ls
>>
>> returns:
>>
>> ''$'\360\261\243\237'
>>
>> Extension H was introduced in Unicode 15.0 in 2022.
>>
>> I also notice that this bug occurs with any character with Extension I (introduced in 2023).
>>
>> Extension G seems to works okay.
> 
> ls 9.4 works as expected for me with glibc-2.39 in a UTF-8 locale.
> I.e. that file is displayed directly.
> Now if I set the locale to non UTF-8 it will display the form above
> (which works on all locales BTW).
> 
>    $ touch ''$'\360\261\243\237'
>    $ ls ''$'\360\261\243\237'
>    𱣟
>    $ LC_ALL=C ls ''$'\360\261\243\237'
>    ''$'\360\261\243\237'
> 
> So I suspect your system libs are not updated to recognize this character,
> hence the fallback format is used.
> 
> cheers,
> Pádraig.
> 
I am on UTF-8 locale (ja_JP.utf8), though with glibc-2.35. I am not sure I can upgrade without breaking dependencies.

Thanks for checking, anyway.







Information forwarded to bug-coreutils <at> gnu.org:
bug#73194; Package coreutils. (Fri, 13 Sep 2024 00:46:02 GMT) Full text and rfc822 format available.

Message #17 received at 73194 <at> debbugs.gnu.org (full text, mbox):

From: Simon Wolfe <sekaihenodoa <at> mutsuba.info>
To: P <at> draigBrady.com, 73194 <at> debbugs.gnu.org
Subject: bug#73194: ls command converts utf-8 character into escape sequences
Date: Fri, 13 Sep 2024 09:45:35 +0900
How does ls version 9.4 do with code points not yet used ?

I'm asking because it seems it takes 2 years for changes to make it to distros; it might be a good idea to code things ahead...

Like if you use  U+40500 ( 񀔀 ) and type

touch ''$'\361\200\224\200'
ls ''$'\361\200\224\200'

will it show 񀔀 or ''$'\361\200\224\200' ?




Added tag(s) notabug. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 16 Feb 2025 06:59:03 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 73194 <at> debbugs.gnu.org and Simon Wolfe <sekaihenodoa <at> mutsuba.info> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 16 Feb 2025 06:59:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 16 Mar 2025 11:24:32 GMT) Full text and rfc822 format available.

This bug report was last modified 116 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.