GNU bug report logs - #7948
multibyte: 16-bit wchar_t on Windows and Cygwin

Previous Next

Package: coreutils;

Reported by: Eric Blake <eblake <at> redhat.com>

Date: Mon, 31 Jan 2011 16:51:02 UTC

Severity: wishlist

Merged with 7963, 7968

To reply to this bug, email your comments to 7948 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Mon, 31 Jan 2011 16:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eric Blake <eblake <at> redhat.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 31 Jan 2011 16:51:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-coreutils <bug-coreutils <at> gnu.org>, cygwin <cygwin <at> cygwin.com>,
	bug-gnulib <at> gnu.org
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Date: Mon, 31 Jan 2011 09:58:19 -0700
[Message part 1 (text/plain, inline)]
[adding cygwin and coreutils for a wc issue]

On 01/30/2011 07:04 PM, Bruno Haible wrote:
> Hi,
> 
> It is known for a long time that on native Windows, the wchar_t[] encoding on
> strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same
> for Cygwin >= 1.7. [2]

POSIX requires that 1 wchar_t corresponds to 1 character; so any use of
surrogates to get the full benefit of UTF-16 falls outside the bounds of
POSIX.  At which point, the POSIX definition of those functions no
longer apply, and we can (try) to make the various wc* functions try to
behave as smartly as possible (as is the case with Cygwin); where those
smarts are only needed when you use surrogate pairs.  If cygwin's
approach is correct, then maybe the thing to do is codify those smarts
for all implementations with 16-bit wchar_t as an extension to POSIX
that all gnulib clients can rely on, and thus minimize the #ifdefs in
such clients.

> What consequences does this have?
> 
>   1) All code that uses the functions from <wctype.h> (wide character
>      classification and mapping) or wcwidth() malfunctions on strings that
>      contains Unicode characters outside the BMP, i.e. outside the range
>      U+0000..U+FFFF.

Not necessarily.  Such code falls outside of POSIX, but it may still be
a well-behaved extension if given sane behavior for how to deal with
surrogates.

>   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
>      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
>      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
>      output bytes when it consumes a wchar_t.

>   Now with a chinese character outside the BMP:
>   $ 	
>         1       4
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         3       6
> 
>   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> 
>   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
>         1       5
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         2       7
>
>   So both the number of characters and the number of words are counted
>   wrong as soon as non-BMP characters occur.
>

Does this represent a bug in cygwin's mbrtowc routines that could be
fixed by cygwin?

Or, does this represent a bug in coreutils for using mbrtowc one
character at a time instead of something like mbsrtowcs to do bulk
conversions?

And if we decide that cygwin's extensions are sane, how much harder is
it to characterize what a program must do to be portable to both 16-bit
and 32-bit wchar_t if they are guaranteed the same behavior for all
hosts of the same-size wchar_t?  In other words, would it really require
that many #ifdefs in coreutils to portably and simultaneously support
both sizes of wchar_t?

> I'm more in favour of overriding wchar_t and all functions that depend on it -
> like we did successfully for the socket functions.
> 
> In practice, this would mean that on Windows (both native Windows and
> Cygwin >= 1.7) the use of a 'wchar_t' module will
>   - override wchar_t to be 32 bits, like in glibc,
>   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
>     corresponding system functions are unusable, the replacements will use the
>     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).

That's a lot of overriding, for anything that uses wchar_t in its API,
and throws out a lot of what cygwin already provides.  It also means
that compiler primitives, like L"xyz", which result in 16-bit wchar_t
arrays, will be unusable with your 32-bit wchar_t override.  In other
words, I don't think it's a good idea to be doing that.

C1x will be adding compiler support for mandatory char16_t and char32_t
types for UTF-16 and UTF-32 data, independently of whether wchar_t is
16-bit or 32-bit; maybe the better thing is to proactively start
providing the new interfaces in <uchar.h> that will result from C1x
adoption (and convert GNU programs to use this rather than wchar_t for
character operations), although without compiler support for u"" and U""
(and even u8""), we are no better than ditching compiler support for L""
if you force a wchar_t size override.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

7.27 Unicode utilities <uchar.h>
1 The header <uchar.h> declares types and functions for manipulating Unicode
 characters.
2 The types declared are mbstate_t (described in 7.29.1) and size_t
(described in
 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the
same type as
uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the
same type as
uint_least32_t (also described in 7.20.1.2).

mbrtoc16
c16rtomb
mbrtoc32
c32rtomb

but no variants for replacing wprintf and friends (convert to multibyte
and use printf and friends instead).

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Wed, 02 Feb 2011 14:24:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Eric Blake <eblake <at> redhat.com>
Cc: bug-coreutils <bug-coreutils <at> gnu.org>, cygwin <cygwin <at> cygwin.com>,
	bug-gnulib <at> gnu.org
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 12:29:03 +0100
Hello Eric,

> ... POSIX requires that 1 wchar_t corresponds to 1 character
> ...
> > What consequences does this have?
> > 
> >   1) All code that uses the functions from <wctype.h> (wide character
> >      classification and mapping) or wcwidth() malfunctions on strings that
> >      contains Unicode characters outside the BMP, i.e. outside the range
> >      U+0000..U+FFFF.
> 
> Not necessarily.  Such code falls outside of POSIX, but it may still be
> a well-behaved extension if given sane behavior for how to deal with
> surrogates.

No. Code that uses <wctype.h> and wcwidth() is written precisely according
to POSIX. The problem is that this code cannot work correctly when wchar_t[]
is in UTF-16 encoding. There simply is no way to define these functions
in a reasonable way for surrogates.

For example:
  U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
  U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
  U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
  U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
  U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
  U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
There is no way that a system can provide this information through a
function 'iswalpha' that takes a single wchar_t argument.

It would be possible to provide this information
  - either through a function iswalpha2 (wchar_t wc1, wchar_t wc2)
    that takes two wchar_t arguments,
  - or through a function uc_is_alpha (ucs4_t uc),
but that is not POSIX, and it would require rewriting each and every
piece of code that currently uses <wctype.h> in the POSIX way.

> we can (try) to make the various wc* functions try to
> behave as smartly as possible (as is the case with Cygwin); where those
> smarts are only needed when you use surrogate pairs.

The point is that this approach can work fine for mbrtowc() and wcrtomb(),
but it cannot yield a working definition for the <wctype.h> functions and
wcwidth().

> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $ 	
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

We agree that it is a bug. And it is caused by
  - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
  - there is no way to define the <wctype.h> POSIX functions sanely in this
    setting, and
  - coreutils and gnulib make use of the POSIX functions.

Even if coreutils were to use mbsrtowcs instead of repeated use of
mbrtowc, there would be no way for it to produce the correct result
without combining surrogates into entire characters.

> And if we decide that cygwin's extensions are sane, how much harder is
> it to characterize what a program must do to be portable to both 16-bit
> and 32-bit wchar_t if they are guaranteed the same behavior for all
> hosts of the same-size wchar_t?  In other words, would it really require
> that many #ifdefs in coreutils to portably and simultaneously support
> both sizes of wchar_t?

It would require
  1. to change the conversions that use mbrtowc to either convert an
     entire string at once (use mbsrtowcs), or make a second call to
     mbrtowc once the first call to mbrtowc has determined a low
     surrogate.
  2. to change all uses of <wctype.h> and wcwidth() to use different
     functions, either functions that take 2 wchar_t arguments, or
     functions that require the caller to combine the surrogates.

This means, lots of logic that goes against the spirit of wchar_t
in ANSI C Amd. 1 and POSIX.

> > I'm more in favour of overriding wchar_t and all functions that depend on it -
> > like we did successfully for the socket functions.
> > 
> > In practice, this would mean that on Windows (both native Windows and
> > Cygwin >= 1.7) the use of a 'wchar_t' module will
> >   - override wchar_t to be 32 bits, like in glibc,
> >   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
> >     corresponding system functions are unusable, the replacements will use the
> >     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).
> ...
> compiler primitives, like L"xyz", which result in 16-bit wchar_t
> arrays, will be unusable

Good point. I agree then that overriding wchar_t should better not be
done.

> C1x will be adding compiler support for mandatory char16_t and char32_t
> types for UTF-16 and UTF-32 data, independently of whether wchar_t is
> 16-bit or 32-bit; maybe the better thing is to proactively start
> providing the new interfaces in <uchar.h> that will result from C1x
> adoption (and convert GNU programs to use this rather than wchar_t for
> character operations)
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

A newer draft is at
https://www.opengroup.org/platform/single_unix_specification/uploads/40/23495/n1548.pdf

This is a good point, but would have two drawbacks:

  - It throws out the use of a POSIX API for a not-yet-standard API,

  - Performance: For the non-UTF-8 locales (ISO-8859-15, EUC-JP, and
    similar) on platforms like MacOS X, FreeBSD, Solaris, the 'wchar_t'
    representation is essentially a packed multibyte representation.
    Which makes mbrtowc() fast, because it does not have to do a table
    lookup for the conversion from/to Unicode. If you use mbrtoc32
    instead of mbrtowc, you add extra runtime overhead for a conversion
    to Unicode, that would not be necessary when using mbrtowc().

In other words, your proposal would solve the Windows wchar_t problem,
but at the price of a performance penalty on traditional Unix systems.

Here's a new proposal:
  - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
    on Windows platforms and to 'wchar_t' otherwise.
  - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
    Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
    'wcwidth' on most platforms, and a use of libunistring modules on
    Windows platforms.

With this proposal,

  - The code that uses <wctype.h> has to be changed, but in a trivial
    way that introduces no complicated logic: Just change 'w' to 'ww'.
    Not more difficult than, say, using strtoll() instead of strtol().

  - The runtime penalty on non-Windows systems is minimal.

  - On Windows platforms, surrogates are handled correctly, and
    code that uses wchar_t or <windows.h> is left alone.

How does that sound? Comments?

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Wed, 02 Feb 2011 17:44:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-gnulib <at> gnu.org, cygwin <cygwin <at> cygwin.com>,
	bug-coreutils <bug-coreutils <at> gnu.org>, Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 02 Feb 2011 09:51:54 -0800
On 02/02/11 03:29, Bruno Haible wrote:
>   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>     on Windows platforms and to 'wchar_t' otherwise.

As a minor point, would it be OK to call this type
'xchar_t' instead?  'x' is the successor to 'w', after all,
and it can be thought of as an abbreviation for 'eXtended'.

A problem with the 'ww' prefix is that mentally I start thinking
"World Wide ..."




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Wed, 02 Feb 2011 18:50:03 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-gnulib <at> gnu.org, cygwin <cygwin <at> cygwin.com>,
	bug-coreutils <bug-coreutils <at> gnu.org>, Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 19:57:06 +0100
Hi Paul,

> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
> >     on Windows platforms and to 'wchar_t' otherwise.
> 
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.

'wwchar_t' means "wide wide character".

In fact it's not really an "extended" character or "complex character".
It's just what POSIX calls a 'wchar_t'.

I like the analogy between strtol and strtoll. In the beginning, people
thought a 'long int' would be enough for everything. Then they discovered
a 'long long int' is needed. The same story repeats itself here with
the "wide characters" which turn out to be not wide enough, and
"wide wide characters" are needed.

> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."

Indeed this meaning can come to mind, but I think it's not dangerous
since the term "world wide" has no meaning in a programming language.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Wed, 02 Feb 2011 20:36:01 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andy Koppe <andy.koppe <at> gmail.com>
To: cygwin <at> cygwin.com
Cc: bug-gnulib <at> gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
	Eric Blake <eblake <at> redhat.com>, bug-coreutils <bug-coreutils <at> gnu.org>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 20:43:17 +0000
On 2 February 2011 18:57, Bruno Haible wrote:
> Hi Paul,
>
>> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>> >     on Windows platforms and to 'wchar_t' otherwise.
>>
>> As a minor point, would it be OK to call this type
>> 'xchar_t' instead?  'x' is the successor to 'w', after all,
>> and it can be thought of as an abbreviation for 'eXtended'.
>
> 'wwchar_t' means "wide wide character".
>
> In fact it's not really an "extended" character or "complex character".
> It's just what POSIX calls a 'wchar_t'.

It's extended in the sense that the original Unicode was only 16 bits
wide (which of course is why wchar_t on Windows is 16 bits). Also, I
think 'xchar_t' is less prone to typos, in particular forgetting one
of the dubyas.

Andy




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7948; Package coreutils. (Thu, 03 Feb 2011 12:50:05 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ulf Zibis <Ulf.Zibis <at> gmx.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <bug-coreutils <at> gnu.org>, cygwin <cygwin <at> cygwin.com>,
	bug-gnulib <at> gnu.org, Bruno Haible <bruno <at> clisp.org>,
	Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Thu, 03 Feb 2011 13:57:30 +0100
Hi,

I think there is a kind of similar bug in discussion on GNU:
bug#7960: [PATCH] fmt: fix formatting multibyte text (bug #7372)

-Ulf


Am 02.02.2011 18:51, schrieb Paul Eggert:
> On 02/02/11 03:29, Bruno Haible wrote:
>>    - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>>      on Windows platforms and to 'wchar_t' otherwise.
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.
>
> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."
>
>
>
>




Merged 7948 7963 7968. Request was from era eriksson <era <at> iki.fi> to control <at> debbugs.gnu.org. (Thu, 30 Aug 2012 08:08:02 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 19 Oct 2018 16:42:02 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: 16-bit wchar_t on Windows and Cygwin' from '16-bit wchar_t on Windows and Cygwin' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 19 Oct 2018 16:42:02 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 162 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.