GNU bug report logs - #32236
df header corrupted with LANG=zh_TW.UTF-8 on macOS

Previous Next

Package: coreutils;

Reported by: Chih-Hsuan Yen <yan12125 <at> gmail.com>

Date: Sat, 21 Jul 2018 16:10:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 32236 in the body.
You can then email your comments to 32236 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sat, 21 Jul 2018 16:10:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Chih-Hsuan Yen <yan12125 <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 21 Jul 2018 16:10:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sat, 21 Jul 2018 22:20:04 +0800
Hi coreutils developers,

I'm using coreutils on macOS High Sierra (10.13). I noticed that with
`LANG=zh_TW.UTF-8`, `df` output is corrupted.

�?�?系統 容�?? 已�?� �?��?� 已�?�% �??�?�?
/dev/disk1s1    234G  151G    81G    65% /
/dev/disk1s4    234G  2.1G    81G     3% /private/var/vm

(I'm not sure if other mail agents can display those characters
correctly or not. See my blog post [1] for the exact output.)

Seems it's similar to bug#25630 [2], which is not resolved. I guess
the reason of my issue is that iscntrl() is broken on macOS High
Sierra, so in hide_problematic_chars(), some bytes in the Chinese
header is replaced with a question mark. I managed to patch coreutils
[3] to make `df` work. Could you have a look? Thanks!

Best,

Chih-Hsuan Yen

[1] https://blog.chyen.cc/posts/2018/06/23/mac-df-chinese.html
[2] http://lists.gnu.org/archive/html/bug-coreutils/2017-02/msg00008.html
[3] https://github.com/yan12125/macports-ports/blob/fix-coreutils-df-chinese/sysutils/coreutils/files/patch-df.diff




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sat, 21 Jul 2018 20:31:02 GMT) Full text and rfc822 format available.

Message #8 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Chih-Hsuan Yen <yan12125 <at> gmail.com>, 32236 <at> debbugs.gnu.org,
 bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sat, 21 Jul 2018 13:30:25 -0700
[Message part 1 (text/plain, inline)]
On 21/07/18 07:20, Chih-Hsuan Yen wrote:
> Hi coreutils developers,
> 
> I'm using coreutils on macOS High Sierra (10.13). I noticed that with
> `LANG=zh_TW.UTF-8`, `df` output is corrupted.
> 
> �?�?系統 容�?? 已�?� �?��?� 已�?�% �??�?�?
> /dev/disk1s1    234G  151G    81G    65% /
> /dev/disk1s4    234G  2.1G    81G     3% /private/var/vm
> 
> (I'm not sure if other mail agents can display those characters
> correctly or not. See my blog post [1] for the exact output.)
> 
> Seems it's similar to bug#25630 [2], which is not resolved. I guess
> the reason of my issue is that iscntrl() is broken on macOS High
> Sierra, so in hide_problematic_chars(), some bytes in the Chinese
> header is replaced with a question mark. I managed to patch coreutils
> [3] to make `df` work. Could you have a look? Thanks!
> 
> Best,
> 
> Chih-Hsuan Yen
> 
> [1] https://blog.chyen.cc/posts/2018/06/23/mac-df-chinese.html
> [2] http://lists.gnu.org/archive/html/bug-coreutils/2017-02/msg00008.html
> [3] https://github.com/yan12125/macports-ports/blob/fix-coreutils-df-chinese/sysutils/coreutils/files/patch-df.diff

Wow. That's surprising. I do see the FreeBSD man pages say:

"The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type
in locales with large character sets is considered obsolete and may not be supported in
future releases."

Now I think that might have been referring to >= 0xFF, but fair enough.

I've attached a gnulib patch to document for iscntrl at least.
It would be great if someone could test the other is*() classification
functions on macOS so that I might have a more complete documentation patch.

I've also attached an alternative patch for df (in your name).
Can you try that one?

thanks!
Pádraig
[df-utf8-osx.patch (text/x-patch, attachment)]
[osx-iscntrl-doc.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sat, 21 Jul 2018 22:44:02 GMT) Full text and rfc822 format available.

Message #11 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: bug-gnulib <at> gnu.org
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>,
 Pádraig Brady <P <at> draigbrady.com>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 00:43:42 +0200
Hi Pádraig,

> I've attached a gnulib patch to document for iscntrl at least.

> +This function does not support arguments outside of the range of the
> +unsigned char type in locales with large character sets, on some platforms.
> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.

In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().

POSIX [1] says
  "The c argument is a type int, the value of which the application shall
   ensure is a character representable as an unsigned char or equal to the
   value of the macro EOF. If the argument has any other value, the behavior
   is undefined."

The term "character" is defined here [2]:
  "A sequence of one or more bytes representing a single graphic symbol or
   control code."

So, in a UTF-8 locale, a "character representable as an unsigned char"
is a byte sequence of length 1, where the single byte has a value in the
range 0x00..0x7F.

For invalid values "the behavior is undefined." You were expecting a value 0.

Now, in the gnulib documentations, what we mention as portability problems
are the cases where
  - the behaviour for valid arguments is different on different platforms, or
  - the boundary between valid and invalid arguments is fuzzy and depends on
    the platform.
IMO there's no point in documenting that a function _really_ has undefined
behaviour when POSIX says that it has undefined behaviour.

> I've also attached an alternative patch for df (in your name).

This patch is correct (because the characters that you test for in c_iscntrl
are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

But it does not catch control characters outside of the ASCII range. It would
make sense to catch these as well. If you want to do that,
'hide_problematic_chars' needs to be rewritten as a loop that iterates across
the multibyte characters. For example with the 'mbiter' module, in
combination with the mb_iscntrl function from the 'mbchar' module. Or
directly with mbrtowc() and iswcntrl().

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
[2] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 06:44:02 GMT) Full text and rfc822 format available.

Message #14 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Pádraig Brady <P <at> draigbrady.com>, bug-gnulib <at> gnu.org,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 14:07:25 +0800
2018-07-22 6:43 GMT+08:00 Bruno Haible <bruno <at> clisp.org>:
> Hi Pádraig,
>
>> I've attached a gnulib patch to document for iscntrl at least.
>
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
>
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
>
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>    ensure is a character representable as an unsigned char or equal to the
>    value of the macro EOF. If the argument has any other value, the behavior
>    is undefined."
>
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>    control code."
>
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
>
> For invalid values "the behavior is undefined." You were expecting a value 0.
>
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
>     the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.
>
>> I've also attached an alternative patch for df (in your name).
>
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
>
> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().
>
> Bruno
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
> [2] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
if you want me to test other patches, thanks!

Cheers,

Chih-Hsuan Yen




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 10:48:02 GMT) Full text and rfc822 format available.

Message #17 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Chih-Hsuan Yen <yan12125 <at> gmail.com>
Cc: Pádraig Brady <P <at> draigbrady.com>, bug-gnulib <at> gnu.org,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 12:46:59 +0200
Chih-Hsuan Yen wrote:
> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
> if you want me to test other patches, thanks!

You could test how it behaves with mount points that contain U+2028 or
U+2029 characters. On Linux, I'd test it like this. Hope it's similar
on macOS:
$ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop0                1986048    1986048          0  100% /tmp/abc�def�ghi
$ ls -ld /tmp/abc*
dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 14:08:01 GMT) Full text and rfc822 format available.

Message #20 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Pádraig Brady <P <at> draigbrady.com>, bug-gnulib <at> gnu.org,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 20:51:16 +0800
2018-07-22 18:46 GMT+08:00 Bruno Haible <bruno <at> clisp.org>:
> Chih-Hsuan Yen wrote:
>> The `c_iscntrl()` patch also fixes the issue on macOS. Please tell me
>> if you want me to test other patches, thanks!
>
> You could test how it behaves with mount points that contain U+2028 or
> U+2029 characters. On Linux, I'd test it like this. Hope it's similar
> on macOS:
> $ mkdir /tmp/`printf 'abc\u2028def\u2029ghi'`
> $ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
> $ df
> ...
> /dev/loop0                1986048    1986048          0  100% /tmp/abc�def�ghi
> $ ls -ld /tmp/abc*
> dr-xr-xr-x 4 root root 2048 Nov 19  2014 /tmp/abc?def?ghi
>
> Bruno
>

Hi Bruno,

With the c_iscntrl() patch, the result of ls and df are: (I use xxd as
GMail seems unable to handle U+2028 and U+2029 correctly)

$ ls -ld /tmp/abc
def
ghi | xxd
00000000: 6472 7778 722d 7872 2d78 2031 2079 656e  drwxr-xr-x 1 yen
00000010: 2073 7461 6666 2034 3039 3620 3230 3138   staff 4096 2018
00000020: 2f30 372f 3232 2030 323a 3136 3a34 3320  /07/22 02:16:43
00000030: 2f74 6d70 2f61 6263 e280 a864 6566 e280  /tmp/abc...def..
00000040: a967 6869 0a                             .ghi.

$ df | xxd
00000000: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  ............
00000010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7      ......  ....
00000020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  ...... .....
00000030: a825 20e6 8e9b e8bc 89e9 bb9e 0a2f 6465  .% ........../de
00000040: 762f 6469 736b 3173 3120 2020 2032 3334  v/disk1s1    234
00000050: 4720 2031 3337 4720 2020 3935 4720 2020  G  137G   95G
00000060: 3630 2520 2f0a 2f64 6576 2f64 6973 6b31  60% /./dev/disk1
00000070: 7334 2020 2020 3233 3447 2020 322e 3147  s4    234G  2.1G
00000080: 2020 2039 3547 2020 2020 3325 202f 7072     95G    3% /pr
00000090: 6976 6174 652f 7661 722f 766d 0a63 6879  ivate/var/vm.chy
000000a0: 656e 2e63 633a 2020 2020 2020 2020 3235  en.cc:        25
000000b0: 4720 2020 3132 4720 2020 3132 4720 2020  G   12G   12G
000000c0: 3532 2520 2f70 7269 7661 7465 2f74 6d70  52% /private/tmp
000000d0: 2f61 6263 e280 a864 6566 e280 a967 6869  /abc...def...ghi
000000e0: 0a                                       .

Without the c_iscntrl() patch (unmodified 8.30), ls behaves the same,
and the result of df is:

$ df | xxd
00000000: e6aa 3fe6 a13f e7b3 bbe7 b5b1 20e5 aeb9  ..?..?...... ...
00000010: e93f 3f20 e5b7 b2e7 3fa8 20e5 3faf e73f  .?? ....?. .?..?
00000020: a820 e5b7 b2e7 3fa8 2520 e63f 3fe8 bc3f  . ....?.% .??..?
00000030: e9bb 3f0a 2f64 6576 2f64 6973 6b31 7331  ..?./dev/disk1s1
00000040: 2020 2020 3233 3447 2020 3133 3747 2020      234G  137G
00000050: 2020 3935 4720 2020 2036 3025 202f 0a2f    95G    60% /./
00000060: 6465 762f 6469 736b 3173 3420 2020 2032  dev/disk1s4    2
00000070: 3334 4720 2032 2e31 4720 2020 2039 3547  34G  2.1G    95G
00000080: 2020 2020 2033 2520 2f70 7269 7661 7465       3% /private
00000090: 2f76 6172 2f76 6d0a 6368 7965 6e2e 6363  /var/vm.chyen.cc
000000a0: 3a20 2020 2020 2020 2032 3547 2020 2031  :        25G   1
000000b0: 3247 2020 2020 3132 4720 2020 2035 3125  2G    12G    51%
000000c0: 202f 7072 6976 6174 652f 746d 702f 6162   /private/tmp/ab
000000d0: 63e2 3fa8 6465 66e2 3fa9 6768 690a       c.?.def.?.ghi.

Hope those results are helpful!

Chih-Hsuan Yen




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 15:13:02 GMT) Full text and rfc822 format available.

Message #23 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Chih-Hsuan Yen <yan12125 <at> gmail.com>, 32236 <at> debbugs.gnu.org,
 bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 08:12:04 -0700
[Message part 1 (text/plain, inline)]
Pádraig Brady wrote:
> I've also attached an alternative patch for df (in your name).

That still has problems, since it can generate improperly-encoded strings in 
UTF-8 locales (if the inputs are improperly encoded), and can replace parts of 
multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
patch instead, which attempts to address these issues. This is more along the 
lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
simpler overall just to use mbrtowc directly for this one thing.
[0001-df-avoid-multibyte-character-corruption-on-macOS.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 16:10:02 GMT) Full text and rfc822 format available.

Message #26 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-gnulib <bug-gnulib <at> gnu.org>,
 Pádraig Brady <P <at> draigbrady.com>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Mon, 23 Jul 2018 00:09:45 +0800
2018-07-22 23:12 GMT+08:00 Paul Eggert <eggert <at> cs.ucla.edu>:
> Pádraig Brady wrote:
>>
>> I've also attached an alternative patch for df (in your name).
>
>
> That still has problems, since it can generate improperly-encoded strings in
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts
> of multibyte characters with '?' in non-UTF-8 locales. Please try the
> attached patch instead, which attempts to address these issues. This is more
> along the lines that Bruno suggested, except it doesn't use mbsiter as I
> figured it was simpler overall just to use mbrtowc directly for this one
> thing.

Here's the result of df:

$ df
檔案系統        容量  已用  可用 已用 掛載點
/dev/disk1s1    234G  137G   95G  60% /
/dev/disk1s4    234G  2.1G   95G   3% /private/var/vm
chyen.cc:        25G   12G   12G  51% /private/tmp/abc def ghi

$ df | xxd
00000000: e6aa 94e6 a188 e7b3 bbe7 b5b1 2020 2020  ............
00000010: 2020 2020 e5ae b9e9 878f 2020 e5b7 b2e7      ......  ....
00000020: 94a8 2020 e58f afe7 94a8 20e5 b7b2 e794  ..  ...... .....
00000030: a820 e68e 9be8 bc89 e9bb 9e0a 2f64 6576  . ........../dev
00000040: 2f64 6973 6b31 7331 2020 2020 3233 3447  /disk1s1    234G
00000050: 2020 3133 3747 2020 2039 3547 2020 3630    137G   95G  60
00000060: 2520 2f0a 2f64 6576 2f64 6973 6b31 7334  % /./dev/disk1s4
00000070: 2020 2020 3233 3447 2020 322e 3147 2020      234G  2.1G
00000080: 2039 3547 2020 2033 2520 2f70 7269 7661   95G   3% /priva
00000090: 7465 2f76 6172 2f76 6d0a 6368 7965 6e2e  te/var/vm.chyen.
000000a0: 6363 3a20 2020 2020 2020 2032 3547 2020  cc:        25G
000000b0: 2031 3247 2020 2031 3247 2020 3531 2520   12G   12G  51%
000000c0: 2f70 7269 7661 7465 2f74 6d70 2f61 6263  /private/tmp/abc
000000d0: e280 a864 6566 e280 a967 6869 0a         ...def...ghi.

Chinese header names are correct, and U+2028 and U+2029 are written
as-is. All tested with LANG=zh_TW.UTF-8 LC_COLLATE=C
LC_CTYPE=zh_TW.UTF-8.




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 16:18:02 GMT) Full text and rfc822 format available.

Message #29 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Chih-Hsuan Yen <yan12125 <at> gmail.com>,
 32236 <at> debbugs.gnu.org, bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 09:17:07 -0700
On 22/07/18 08:12, Paul Eggert wrote:
> Pádraig Brady wrote:
>> I've also attached an alternative patch for df (in your name).
> 
> That still has problems, since it can generate improperly-encoded strings in 
> UTF-8 locales (if the inputs are improperly encoded), and can replace parts of 
> multibyte characters with '?' in non-UTF-8 locales. Please try the attached 
> patch instead, which attempts to address these issues. This is more along the 
> lines that Bruno suggested, except it doesn't use mbsiter as I figured it was 
> simpler overall just to use mbrtowc directly for this one thing.

I haven't time to review this now,
but I did want to only avoid \n etc. that might cause issues for
programs that parsed output from df on a line by line basis.
This subset of control characters is safe to identify
It seems problematic to start eliding improperly encoded
mount points for example, rather than just outputting
what's there.

Also just incrementing width++ per each wide character
doesn't seem right, though again I've not tested it.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 16:26:02 GMT) Full text and rfc822 format available.

Message #32 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>, bug-gnulib <at> gnu.org
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 09:25:18 -0700
On 21/07/18 15:43, Bruno Haible wrote:
> Hi Pádraig,
> 
>> I've attached a gnulib patch to document for iscntrl at least.
> 
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
> 
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
> 
> POSIX [1] says
>   "The c argument is a type int, the value of which the application shall
>    ensure is a character representable as an unsigned char or equal to the
>    value of the macro EOF. If the argument has any other value, the behavior
>    is undefined."
> 
> The term "character" is defined here [2]:
>   "A sequence of one or more bytes representing a single graphic symbol or
>    control code."
> 
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
> 
> For invalid values "the behavior is undefined." You were expecting a value 0.
> 
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
>   - the behaviour for valid arguments is different on different platforms, or
>   - the boundary between valid and invalid arguments is fuzzy and depends on
>     the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.


Thanks for all that info. I agree iscntrl() behavior on macOS is within spec,
though is still surprising, and different from other systems.
I agree docs should be as succinct as possible, though...

>> I've also attached an alternative patch for df (in your name).
> 
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

... It might be worth mentioning this subtle point in the c_iscntrl() docs?
"Note this identifies all single byte control chars even in multibyte encodings".

> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().

I was mainly worried here about \n for scripts to robustly parse df output.

cheers,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 17:02:02 GMT) Full text and rfc822 format available.

Message #35 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Chih-Hsuan Yen <yan12125 <at> gmail.com>, 32236 <at> debbugs.gnu.org,
 bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 10:01:09 -0700
Pádraig Brady wrote:
> I did want to only avoid \n etc. that might cause issues for
> programs that parsed output from df on a line by line basis.
> This subset of control characters is safe to identify
> It seems problematic to start eliding improperly encoded
> mount points for example, rather than just outputting
> what's there.

Yes, I suppose you're right, it's not df's job to police encodings.

> Also just incrementing width++ per each wide character
> doesn't seem right, though again I've not tested it.

True as well. OK, please ignore my patch.

I was prompted by worries about multibyte encodings that use bytes that could be 
misinterpreted as ASCII control characters, such as a locale that uses EBCDIC 
encoding. However, that's probably just a theoretical concern; no coreutils 
users use EBCDIC any more, right? Plus there are doubtless lots of other places 
in coreutils that assume '\n' is a newline in encoded text.




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 21:36:01 GMT) Full text and rfc822 format available.

Message #38 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: bug-gnulib <at> gnu.org
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, Paul Eggert <eggert <at> cs.ucla.edu>,
 Pádraig Brady <P <at> draigbrady.com>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 23:35:21 +0200
Pádraig Brady wrote:
> but I did want to only avoid \n etc. that might cause issues for
> programs that parsed output from df on a line by line basis.

The current code (which uses iscntrl) also catches escape sequences
that can cause weird output on the screen, in a terminal emulator.
This is good (because it can confuse a human reader as much as a '\n'
would confuse a line-by-line parser).

Now, this feature currently only works for escape sequence that
start with an ASCII escape U+001B. It would be useful also for
other control characters to be caught, at least:
  * escape characters U+009B.
  * other characters that cause a newline in a terminal emulator:
    U+2028 and U+2029.
For example, in konsole, the escape sequence '\u009bf' repositions
the cursor. So the effects of

$ mkdir /tmp/`printf 'abc\u009bf'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop0                1986048    1986048          0  100% /tmp/abc�f

is that 'df' produces an U+FFFD. This is less useful than what
it produces for an ASCII escape:

$ mkdir /tmp/`printf 'abc\u001b[2J'`
$ sudo mount -r -t iso9660 -o loop /some/iso/image.iso /tmp/abc*
$ df
...
/dev/loop0                 692828     692828          0  100% /tmp/abc?[2J

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 22 Jul 2018 21:41:02 GMT) Full text and rfc822 format available.

Message #41 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, bug-gnulib <at> gnu.org,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 23:40:35 +0200
Pádraig Brady wrote:
> > This patch is correct (because the characters that you test for in c_iscntrl
> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
> 
> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
> "Note this identifies all single byte control chars even in multibyte encodings".

Only in the multibyte encodings that are currently in use. We never know what
kinds of features or misfeatures new multibyte encodings will come up with:
Before GB18030 was introduced, it was a common feature of all multibyte encodings
(including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
second or later byte in a multibyte character. Well, GB18030 broke this assumption.

So, it is dangerous to rely on this property. Therefore I wouldn't like to
document it in the c_iscntrl() documentation.

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Wed, 25 Jul 2018 15:52:01 GMT) Full text and rfc822 format available.

Message #44 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-gnulib <bug-gnulib <at> gnu.org>,
 Pádraig Brady <P <at> draigbrady.com>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Wed, 25 Jul 2018 23:51:13 +0800
2018-07-23 5:40 GMT+08:00 Bruno Haible <bruno <at> clisp.org>:
> Pádraig Brady wrote:
>> > This patch is correct (because the characters that you test for in c_iscntrl
>> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
>> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
>>
>> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
>> "Note this identifies all single byte control chars even in multibyte encodings".
>
> Only in the multibyte encodings that are currently in use. We never know what
> kinds of features or misfeatures new multibyte encodings will come up with:
> Before GB18030 was introduced, it was a common feature of all multibyte encodings
> (including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
> second or later byte in a multibyte character. Well, GB18030 broke this assumption.
>
> So, it is dangerous to rely on this property. Therefore I wouldn't like to
> document it in the c_iscntrl() documentation.
>
> Bruno
>

Hello any update on this? Discussions about encodings are beyond my
knowledge, yet I can feel that it's difficult to correctly filter
control characters. How about following the idea from Pádraig Brady
and filter \n only?




Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Thu, 26 Jul 2018 09:03:02 GMT) Full text and rfc822 format available.

Message #47 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Chih-Hsuan Yen <yan12125 <at> gmail.com>, Bruno Haible <bruno <at> clisp.org>
Cc: bug-gnulib <bug-gnulib <at> gnu.org>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Thu, 26 Jul 2018 02:01:53 -0700
[Message part 1 (text/plain, inline)]
Chih-Hsuan Yen wrote:
> How about following the idea from Pádraig Brady
> and filter \n only?

Given the later comments it seems better to filter out encoding errors and 
control characters. Programs that parse the output already cannot trust the 
strings to be exactly right, since newlines are gonna get replaced no matter 
what. So there seems little benefit to copying the other garbage faithfully.

Revised proposed patch(es) attached.
[0001-df-avoid-multibyte-character-corruption-on-macOS.patch (text/x-patch, attachment)]
[0002-df-tune-slightly.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Thu, 26 Jul 2018 10:10:02 GMT) Full text and rfc822 format available.

Message #50 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, bug-gnulib <bug-gnulib <at> gnu.org>,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Thu, 26 Jul 2018 12:09:51 +0200
Paul Eggert wrote:
> Revised proposed patch(es) attached.

Looks good to me, except for one little thing:

>           memcpy (dst, src, n);

src and dst may overlap. Therefore memmove should be used instead of memcpy.

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Thu, 26 Jul 2018 17:35:02 GMT) Full text and rfc822 format available.

Message #53 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Chih-Hsuan Yen <yan12125 <at> gmail.com>,
 Bruno Haible <bruno <at> clisp.org>
Cc: bug-gnulib <bug-gnulib <at> gnu.org>, 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Thu, 26 Jul 2018 10:34:47 -0700
On 26/07/18 02:01, Paul Eggert wrote:
> Chih-Hsuan Yen wrote:
>> How about following the idea from Pádraig Brady
>> and filter \n only?
> 
> Given the later comments it seems better to filter out encoding errors and 
> control characters. Programs that parse the output already cannot trust the 
> strings to be exactly right, since newlines are gonna get replaced no matter 
> what. So there seems little benefit to copying the other garbage faithfully.
> 
> Revised proposed patch(es) attached.

This is better, though this means that mount points now
need to match the locale of df or they won't be displayed.
Theoretically that was the case previously, but only for control chars
and so wouldn't have have had a practical impact for mounts
encoded in another local, only for security/robustness reasons where
one might have \n etc.

I've pushed the c_iscntrl patch since it's simplest
and probably most appropriate patch for an existing release.

If you consider the matching encoding issue as a non issue,
then I'm OK with this.

cheers,
Pádraig




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 27 Jul 2018 01:24:02 GMT) Full text and rfc822 format available.

Notification sent to Chih-Hsuan Yen <yan12125 <at> gmail.com>:
bug acknowledged by developer. (Fri, 27 Jul 2018 01:24:02 GMT) Full text and rfc822 format available.

Message #58 received at 32236-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>,
 Chih-Hsuan Yen <yan12125 <at> gmail.com>, Bruno Haible <bruno <at> clisp.org>
Cc: 32236-done <at> debbugs.gnu.org, bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Thu, 26 Jul 2018 18:23:02 -0700
[Message part 1 (text/plain, inline)]
Pádraig Brady wrote:
> I've pushed the c_iscntrl patch since it's simplest
> and probably most appropriate patch for an existing release.

Yes, that makes sense for a quick patch. However, for the next release I think 
it'd be better to catch encoding errors and multibyte control characters, given 
the problems noted. I installed the attached further patch to try to do this. 
This fixes the problem that Bruno noted, along with two others; my earlier patch 
neglected the possibility that mbrtowc can return 0, and it incorrectly assumed 
wide control characters always have a single-byte representation.

Either way the original bug appears to be fix so I'm boldly closing the bug report.
[0001-df-avoid-multibyte-character-corruption-on-macOS.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Fri, 27 Jul 2018 09:39:02 GMT) Full text and rfc822 format available.

Message #61 received at 32236-done <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, bug-gnulib <bug-gnulib <at> gnu.org>,
 Pádraig Brady <P <at> draigbrady.com>,
 32236-done <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Fri, 27 Jul 2018 11:38:22 +0200
Paul Eggert wrote:
> my earlier patch 
> neglected the possibility that mbrtowc can return 0

I wouldn't see this as a bug: You can assume that mbrtowc returns
0 if and only if the multibyte sequence is a NUL byte - but you had
chosen srcend in such a way that this would not happen in the loop.

> and it incorrectly assumed 
> wide control characters always have a single-byte representation.

Oops, you're right. My mistake as well.

The new patch looks good.

This will catch (and replace with '?') U+2028 and U+2029 on glibc systems.
On macOS, it will not do this, because iswcntrl(0x2028) and iswcntrl(0x2029)
is 0 on this system; this is consistent with the fact that the 'Terminal'
program displays these characters as simple spaces. So, no need to override
iswcntrl on macOS.

Bruno


2018-07-27  Bruno Haible  <bruno <at> clisp.org>

	iswcntrl: Mention minor problem on macOS.
	* doc/posix-functions/iswcntrl.texi: Mention oddity on macOS.

diff --git a/doc/posix-functions/iswcntrl.texi b/doc/posix-functions/iswcntrl.texi
index 99eaa0e..44dd034 100644
--- a/doc/posix-functions/iswcntrl.texi
+++ b/doc/posix-functions/iswcntrl.texi
@@ -25,4 +25,8 @@ Portability problems not fixed by Gnulib:
 @item
 On AIX and Windows platforms, @code{wchar_t} is a 16-bit type and therefore cannot
 accommodate all Unicode characters.
+@item
+This function returns 0 for U+2028 (LINE SEPARATOR) and
+U+2029 (PARAGRAPH SEPARATOR) on some platforms:
+Mac OS X 10.13.
 @end itemize





Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Fri, 27 Jul 2018 19:06:01 GMT) Full text and rfc822 format available.

Message #64 received at 32236-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, bug-gnulib <bug-gnulib <at> gnu.org>,
 Pádraig Brady <P <at> draigbrady.com>, 32236-done <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Fri, 27 Jul 2018 12:05:17 -0700
[Message part 1 (text/plain, inline)]
Bruno Haible wrote:
> You can assume that mbrtowc returns
> 0 if and only if the multibyte sequence is a NUL byte - but you had
> chosen srcend in such a way that this would not happen in the loop.

Thanks for the correction. I mistakenly thought that C allows multibyte 
encodings in which a null wide character's multibyte representation contains an 
all-bits-zero byte. I installed the attached to omit the unnecessary test.
[0001-df-omit-redundant-comparison.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 29 Jul 2018 05:55:01 GMT) Full text and rfc822 format available.

Message #67 received at 32236-done <at> debbugs.gnu.org (full text, mbox):

From: Chih-Hsuan Yen <yan12125 <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Pádraig Brady <P <at> draigbrady.com>,
 bug-gnulib <bug-gnulib <at> gnu.org>, Bruno Haible <bruno <at> clisp.org>,
 32236-done <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 29 Jul 2018 13:53:53 +0800
2018-07-28 3:05 GMT+08:00 Paul Eggert <eggert <at> cs.ucla.edu>:
> Bruno Haible wrote:
>>
>> You can assume that mbrtowc returns
>> 0 if and only if the multibyte sequence is a NUL byte - but you had
>> chosen srcend in such a way that this would not happen in the loop.
>
>
> Thanks for the correction. I mistakenly thought that C allows multibyte
> encodings in which a null wide character's multibyte representation contains
> an all-bits-zero byte. I installed the attached to omit the unnecessary
> test.

Thanks you all for the efforts! I've installed commit
e5dae2c6b0bcd0e4ac6e5b212688d223e2e62f79 of coreutils, and `df` works
like a charm!

Cheers!

Chih-Hsuan Yen




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 26 Aug 2018 11:24:06 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sun, 03 Mar 2019 22:50:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#32236; Package coreutils. (Sun, 03 Mar 2019 22:55:01 GMT) Full text and rfc822 format available.

Message #74 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 32236 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, yan12125 <at> gmail.com
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 3 Mar 2019 14:53:56 -0800
[Message part 1 (text/plain, inline)]
On 26/07/18 18:23, Paul Eggert wrote:
> Pádraig Brady wrote:
>> I've pushed the c_iscntrl patch since it's simplest
>> and probably most appropriate patch for an existing release.
> 
> Yes, that makes sense for a quick patch. However, for the next release I think 
> it'd be better to catch encoding errors and multibyte control characters, given 
> the problems noted. I installed the attached further patch to try to do this. 
> This fixes the problem that Bruno noted, along with two others; my earlier patch 
> neglected the possibility that mbrtowc can return 0, and it incorrectly assumed 
> wide control characters always have a single-byte representation.
> 
> Either way the original bug appears to be fix so I'm boldly closing the bug report.

Reviewing this, I dislike the way that we're now enforcing that
the file system locale needs to match the current user's locale
or otherwise df will not output all original characters.
That has the potential to break scripts, as mismatched
encodings is a common issue.

In the attached I've taken the original less aggressive replacement
policy when not outputting to a tty, leaving more sanitizing to the tty case.

cheers,
Pádraig
[df-relax-encoding.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 01 Apr 2019 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 19 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.