GNU bug report logs - #24924
multibyte: pr has no concept of wide characters

Previous Next

Package: coreutils;

Reported by: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>

Date: Fri, 11 Nov 2016 16:12:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 24924 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Fri, 11 Nov 2016 16:12:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 11 Nov 2016 16:12:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: bug-coreutils <at> gnu.org
Subject: pr has no concept of wide characters
Date: Sat, 12 Nov 2016 00:10:46 +0800
The pr documentation (man, info) doesn't mention how it has no concept
of wide characters.

$ pr -m --sep-string='^^^'  file file

2016-11-12 00:06                                                  Page 1


<!DOCTYPE HTML PUBLIC "-//W3C//DTD^^^<!DOCTYPE HTML PUBLIC "-//W3C//DTD
"http://www.w3.org/TR/html4/strict^^^"http://www.w3.org/TR/html4/strict
<html lang="zh-tw">               ^^^<html lang="zh-tw">
<head>                            ^^^<head>
 <meta http-equiv="Content-Type" c^^^ <meta http-equiv="Content-Type" c
 "text/html; charset=utf-8">      ^^^ "text/html; charset=utf-8">
 <meta name="viewport" content="wi^^^ <meta name="viewport" content="wi
 <title>My groups ordered by ...</^^^ <title>My groups ordered by ...</
 <base href="https://www.facebook.^^^ <base href="https://www.facebook.
</head>                           ^^^</head>
<body>                            ^^^<body>
 <dl>                             ^^^ <dl>
  <dt>"同志|Queer|Gdi"</dt>               ^^^  <dt>"同志|Queer|Gdi"</dt>
   <dd>  5 o 台灣同志遊行聯盟 Taiwan LGBT Pride Co^^^   <dd>  5 o 台灣同志遊行聯盟 Taiwan LGBT Pride Co
   <dd>  0 o 台灣同志交友聯盟 301797916498866<BR> ^^^   <dd>  0 o 台灣同志交友聯盟 301797916498866<BR>
   <dd> 25 o 我是(直)同志,我很驕傲! 185779952675<BR> ^^^       <dd> 25 o 我是(直)同志,我很驕傲! 185779952675<BR>
   <dd> 25 o 台灣酷兒權益推動聯盟 Taiwan Gender Queer ^^^       <dd> 25 o 台灣酷兒權益推動聯盟 Taiwan Gender Queer
  <dt>"性別|蝶園" BUT NOT "TV"</dt>       ^^^  <dt>"性別|蝶園" BUT NOT "TV"</dt>
   <dd>  0 c 跨性別與女性主義 Transgender&amp;Femi^^^   <dd>  0 c 跨性別與女性主義 Transgender&amp;Femi
   <dd>  2 c 中部性別團體聯盟 293589073985313<BR> ^^^   <dd>  2 c 中部性別團體聯盟 293589073985313<BR>
   <dd>  1 o 台灣TG蝶園 320448571355058<BR^^^   <dd>  1 o 台灣TG蝶園 320448571355058<BR
   <dd>  0 o 中華民國跨性別者生活權益促進合作社訊息發布站 252346365161476<BR> ^^^       <dd>  0 o 中華民國跨性別者生活權益促進合作社訊息發布站 252346365161476<BR>
   <dd>  3 o 性別不明關懷協會(Beyond Gender) 17160^^^   <dd>  3 o 性別不明關懷協會(Beyond Gender) 17160
   <dd>  0 o 偽百合與偽娘、跨性別們的哲學、思想交流社群 810661859077873<BR> ^^^ <dd>  0 o 偽百合與偽娘、跨性別們的哲學、思想交流社群 810661859077873<BR>
$ pr --version
pr (GNU coreutils) 8.25




Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Fri, 11 Nov 2016 16:37:02 GMT) Full text and rfc822 format available.

Message #8 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: pr has no concept of wide characters
Date: Fri, 11 Nov 2016 11:36:20 -0500
severity 24924 wishlist
tags 24924 wishlist notabug
thanks

Hello Dan,

On 11/11/2016 11:10 AM, 積丹尼 Dan Jacobson wrote:
> The pr documentation (man, info) doesn't mention how it has no concept
> of wide characters.
> $ pr -m --sep-string='^^^'  file file

Indeed, most of the current coreutils programs do not support wide or multi-byte characters correctly.
The current official implementation does not support it (which is why I marked this item as 'wishlist' and not a bug).
On RedHat systems, there is the 'i18n' patch, which adds some support but also introduces some problematic issues:
  https://github.com/pixelb/coreutils/tree/i18n

However, there is an active effort to make all of them multibyte aware.
The latest updates are (in reverse chronological order, these are somewhat long threads):
  http://lists.gnu.org/archive/html/coreutils/2016-09/msg00026.html
  http://lists.gnu.org/archive/html/coreutils/2016-09/msg00011.html
  http://lists.gnu.org/archive/html/coreutils/2016-07/msg00013.html

'cut' and 'expand' were the first two programs I worked on.
'pr' is definitely on the list - once I have a proof-of-concept working, I would very much appreciate if you could help me test it as there are many edge-cases with multibyte support and wide-characters.

As a curiosity,
are you using UTF-8 locales exclusively, or do you have experience with Shift-JIS or EUC-JP locales?


I'm leaving this ticket open, and welcome discussion and comments.
regards,
 - assaf


P.S.
The usual disclaimer applies: there is currently no ETA for multibyte support in coreutils.







Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Sat, 12 Nov 2016 10:13:01 GMT) Full text and rfc822 format available.

Message #11 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: pr has no concept of wide characters
Date: Sat, 12 Nov 2016 18:12:36 +0800
>>>>> "AG" == Assaf Gordon <assafgordon <at> gmail.com> writes:

AG> I would very much appreciate if you could help me test it as there
AG> are many edge-cases with multibyte support and wide-characters.

Sure but you need to send me a .deb or
$ which pr|xargs file
/usr/bin/pr: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV),
dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux
2.6.32, BuildID[sha1]=14376d20f6383ec9348da986ecc693c6bb45a0ee, stripped

AG> As a curiosity,
AG> are you using UTF-8 locales exclusively, or do you have experience
AG> with Shift-JIS or EUC-JP locales?

Nope I just use zh_TW.utf8 all the time.




Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Wed, 30 Nov 2016 11:31:01 GMT) Full text and rfc822 format available.

Message #14 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: 24924 <at> debbugs.gnu.org
Subject: GNU pr only working with singlebyte 1-width characters
Date: Wed, 30 Nov 2016 11:30:34 +0000
Only arguing on the classification of this bug here.

Let's call a cat a cat. When something doesn't work as
documented, it's a bug, not a wishlist entry.

AFAICT, there's nothing in the GNU coreutils documentation that
states that pr only works on input that consists exclusively of
single-byte characters that are neither zero-width (though it
copes OK with ASCII BS and TAB) nor double-width (or on
ASCII-only input).

Today, UTF-8 is the most commonly  used character set, so it
even affects English text (where £ (the British currency symbol)
is encoded on two bytes in UTF-8 for instance), and even
US-English text like for the ‘quoting characters’ (3 bytes each
in UTF-8) now that ASCII ' has been demoted to just an
apostrophe.

That can also be seen as a POSIX conformance bug (though GNU
coreutils doesn't claim POSIX conformance, only "The GNU
utilities documented here are /mostly/ compatible with the
POSIX standard").

$ pr -tm --sep-string='|'  <(du --version) <(truncate --version)
du (GNU coreutils) 8.25            |truncate (GNU coreutils) 8.25
Copyright (C) 2016 Free Software Fo|Copyright (C) 2016 Free Software Fo
License GPLv3+: GNU GPL version 3 o|License GPLv3+: GNU GPL version 3 o
This is free software: you are free|This is free software: you are free
There is NO WARRANTY, to the extent|There is NO WARRANTY, to the extent
                                   |
Written by Torbjörn Granlund, David |Written by Pádraig Brady.
and Jim Meyering.                  |

-- 
Stephane




Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Thu, 01 Dec 2016 02:38:02 GMT) Full text and rfc822 format available.

Message #17 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>, 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: GNU pr only working with singlebyte 1-width characters
Date: Wed, 30 Nov 2016 18:37:05 -0800
On 11/30/2016 03:30 AM, Stephane Chazelas wrote:
> That can also be seen as a POSIX conformance bug

Not really, as POSIX does not require support for UTF-8 (except in the 
pax utility, which is not part of coreutils).

It'd be nice if pr etc. could be made to work cleanly for UTF-8. In the 
meantime if you could submit a patch for the documentation that should 
fix the immediate documentation problem.





Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Thu, 01 Dec 2016 06:33:02 GMT) Full text and rfc822 format available.

Message #20 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: GNU pr only working with singlebyte 1-width characters
Date: Thu, 1 Dec 2016 06:32:22 +0000
2016-11-30 18:37:05 -0800, Paul Eggert:
> On 11/30/2016 03:30 AM, Stephane Chazelas wrote:
> >That can also be seen as a POSIX conformance bug
> 
> Not really, as POSIX does not require support for UTF-8 (except in
> the pax utility, which is not part of coreutils).
[...]

POSIX does not require support for any charset. It only
specifies one locale (C/POSIX), doesn't specify the charset in
that locale  other than it should be a single byte charset that
covers the portable character set. Examples of such charsets are
ASCII, iso8859-x or EBCDIC. In practice, that tends to be ASCII
(except for some rare EBCDIC based IBM systems) as tha

But it does support a localisation API and allows system to
support other locales with other charsets. That API does support
multi-byte encodings, including stateful ones (though how they
are /defined/ is implementation defined for lock-shift ones and
in practice those are unworkable so I'd expect those would
eventually be removed from the standard). It doesn't require
compliant systems to have locales with multi-byte character sets,
but if they have (if they show up in the output of locale -a),
then they have to be supported throughout (as specified, for all
the utilities for instance).

Basically, on systems that have locales with multi-byte
encodings --UTF-8 or other-- (most Unix-like ones including GNU
systems like Debian), GNU pr (and many other GNU utilities) is
not POSIX compliant.

See
http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap06.html

for details.

-- 
Stephane




Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Thu, 01 Dec 2016 07:05:01 GMT) Full text and rfc822 format available.

Message #23 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: GNU pr only working with singlebyte 1-width characters
Date: Thu, 1 Dec 2016 07:04:05 +0000
2016-11-30 18:37:05 -0800, Paul Eggert:
[...]
> In the meantime if you could submit a patch for the
> documentation that should fix the immediate documentation
> problem.
[...]

What about:

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index cc85f22..6eb497b 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -1838,6 +1838,12 @@ For single
 column output no line truncation occurs by default.  Use @option{-W} option to
 truncate lines in that case.
 
+Please note that @command{pr} currently doesn't support multi-byte characters
+or non-ASCII characters that have a null or double width. If such characters
+occur in the input or column separators, column alignment may be off or lines
+may exceed the page width. There is also no provision to support bidirectional
+text.
+
 The following changes were made in version 1.22i and apply to later
 versions of @command{pr}:
 @c FIXME: this whole section here sounds very awkward to me. I




Information forwarded to bug-coreutils <at> gnu.org:
bug#24924; Package coreutils. (Thu, 01 Dec 2016 08:50:01 GMT) Full text and rfc822 format available.

Message #26 received at 24924 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 24924 <at> debbugs.gnu.org
Subject: Re: bug#24924: GNU pr only working with singlebyte 1-width characters
Date: Thu, 1 Dec 2016 08:49:39 +0000
2016-12-01 07:04:05 +0000, Stephane Chazelas:
> 2016-11-30 18:37:05 -0800, Paul Eggert:
> [...]
> > In the meantime if you could submit a patch for the
> > documentation that should fix the immediate documentation
> > problem.
> [...]
> 
> What about:
[...]
> +Please note that @command{pr} currently doesn't support multi-byte characters
> +or non-ASCII characters that have a null or double width. If such characters
> +occur in the input or column separators, column alignment may be off or lines
> +may exceed the page width. There is also no provision to support bidirectional
> +text.
[...]

Actually, it seems it can also truncate lines in the middle of
some characters though it seems it's confined to multibyte
characters that have byte values <= 127 like:

$ locale charmap
BIG5-HKSCS
$ printf '\ue9\ue9\ue9\n' | pr -w5 -t2 | hd
00000000  88 6d 88 6d 88 0a                                 |.m.m..|
00000006

See how that third é (0x88 0x6d in BIG5-HKSCS) was truncated in
the middle.

It's as if it was considering all byte values >= 128 as having
zero width in multi-byte locales (and only in multi-byte
locales, that doesn't seem to occur in single-byte ones).

So maybe:

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index cc85f22..15088ce 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -1838,6 +1838,13 @@ For single
 column output no line truncation occurs by default.  Use @option{-W} option to
 truncate lines in that case.
 
+Please note that @command{pr} currently doesn't support multi-byte characters
+or non-ASCII characters that have a null or double width. If such characters
+occur in the input or column separators, column alignment may be off or lines
+may exceed the page width, or truncation may occur in the middle of some
+characters producing invalid text output. There is also no provision to support
+bidirectional text.
+
 The following changes were made in version 1.22i and apply to later
 versions of @command{pr}:
 @c FIXME: this whole section here sounds very awkward to me. I

-- 
Stephane




Changed bug title to 'multibyte: pr has no concept of wide characters' from 'pr has no concept of wide characters' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 07:21:01 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 07:21:01 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.