GNU bug report logs - #7968
multibyte: 16-bit wchar_t on Windows and Cygwin

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Bastien ROUCARIES <roucaries.bastien@HIDDEN>; merged with #7948, #7963; dated Wed, 2 Feb 2011 19:04:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: 16-bit wchar_t on Windows and Cygwin' from 'RE : Re: 16-bit wchar_t on Windows and Cygwin' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Merged 7948 7963 7968. Request was from era eriksson <era@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Feb 2011 19:03:38 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 02 14:03:38 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PkhzN-0005Fn-LL
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 14:03:38 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <roucaries.bastien@HIDDEN>) id 1Pkhjr-0004sF-Tn
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:47:36 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <roucaries.bastien@HIDDEN>) id 1Pkhrx-0002XZ-4H
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:56:02 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	HTML_MESSAGE,RCVD_IN_DNSWL_LOW,T_DKIM_INVALID autolearn=unavailable
	version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:38972)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <roucaries.bastien@HIDDEN>) id 1Pkhrw-0002XV-Vh
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:55:57 -0500
Received: from [140.186.70.92] (port=56113 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pkhrr-0001Np-E1
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 13:55:56 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <roucaries.bastien@HIDDEN>) id 1Pkhrm-0002SB-5g
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 13:55:51 -0500
Received: from mail-ey0-f169.google.com ([209.85.215.169]:34709)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <roucaries.bastien@HIDDEN>)
	id 1PkhrA-0002K6-RZ; Wed, 02 Feb 2011 13:55:09 -0500
Received: by eyh6 with SMTP id 6so242663eyh.0
	for <multiple recipients>; Wed, 02 Feb 2011 10:55:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:date:message-id:subject:from:to:cc
	:content-type; bh=znKKZ+mCSsNPfdA5jc0It7XjEkhzwMj9wrZckFDB05Y=;
	b=p/FnlpanLvzC3RvenLi+tj/SPNhKLxbgd4XeF5X1Z8xr7cBlvniBIdNF+plTf7i6Bt
	+0vCGWZSpLzM938jiPHTDFjlIpeE85d1T640NBuvHNIZQpjU29cIgaqNpY7Q2NIhb7v6
	27kj83iD000yyDVBjf2brtrbVEw5Nukza/NAM=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:date:message-id:subject:from:to:cc:content-type;
	b=MhcoxgXHjQnBEAJOPEAsUcLuZIoyco8ljrd21blknH21IvdZiN+CG4zAMTorez6cag
	z7XtrugGji6o4gfMYnwJBKye6mb6iMangjEd6dsN8Es6nqhF2crQ8o1ukvZWuMXkiUhX
	SSrepp7H+sUWZaeIWuOlfCbZpXXq2CERCLn+o=
MIME-Version: 1.0
Received: by 10.204.117.77 with SMTP id p13mr8712897bkq.19.1296672823188; Wed,
	02 Feb 2011 10:53:43 -0800 (PST)
Received: by 10.204.176.135 with HTTP; Wed, 2 Feb 2011 10:53:43 -0800 (PST)
Received: by 10.204.176.135 with HTTP; Wed, 2 Feb 2011 10:53:43 -0800 (PST)
Date: Wed, 2 Feb 2011 19:53:43 +0100
Message-ID: <AANLkTimd99gqmTwKt_vD=nK6Wvd9fiU4QZf64ATv_-jb@HIDDEN>
Subject: RE : Re: 16-bit wchar_t on Windows and Cygwin
From: Bastien ROUCARIES <roucaries.bastien@HIDDEN>
To: Bruno Haible <bruno@HIDDEN>
Content-Type: multipart/alternative; boundary=0016e6d647db041d1a049b512b2a
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 209.85.215.169
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -5.9 (-----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Wed, 02 Feb 2011 14:03:36 -0500
Cc: bug-coreutils <bug-coreutils@HIDDEN>, cygwin <cygwin@HIDDEN>,
	bug-gnulib@HIDDEN, Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.9 (-----)

--0016e6d647db041d1a049b512b2a
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Using -fno-short-wchar will avoid to change the api.

Bastien

Le 2 f=E9vr. 2011 18:42, "Bruno Haible" <bruno@HIDDEN> a =E9crit :

Hello Corinna,

> And, please note the wording in SUSv4, for instance in
> http://calimero.vinschen.de/susv4/functions/iswalpha.html

Likewise in POSIX:2008, at the URL
http://www.opengroup.org/onlinepubs/9699919799/functions/iswalpha.html

>   The wc argument is a wint_t, the value of which the application shall
>                        ^^^^^^                         ^^^^^^^^^^^
>   ensure is a wide-character code corresponding to a valid character in
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   the current locale, or equal to the value of the macro WEOF. If the
>   argument has any other value, the behavior is undefined.

What this sentence means in formulas, is that when an application passes
a 'wint_t x' to iswalpha(), it has to satisfy

  x =3D=3D (wint_t) (wchar_t) x || x =3D=3D EOF

> iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
> the function can return the correct value, provided that the application
> converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

When an application does this, is passes an invalid wint_t value to
iswalpha(), according to the spec paragraph that you have just cited.
So the application uses an extension to POSIX functionality, not
POSIX itself.

I see that Cygwin 1.7.x iswalpha() works in this way you describe (but
mingw's iswalpha() doesn't). So this means that gnulib's proposed
iswwalpha(wwchar_t) function could be implemented using iswalpha()
on Cygwin 1.7.x and will not cause the Unicode based tables to be
included in the executable. This is good and nice.

But if you say that the application should convert UTF-16 surrogates
to UTF-32 before calling iswalpha: That's certainly a requirement
for Cygwin 1.7.x application that want to support the entire Unicode
character set. But it's outside of POSIX, and many GNU programs will
not want to include this added complexity. Just try to apply this
suggestion to gnulib's quotearg.c, then estimate the time someone
would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,
coreutils/src/wc.c, and so on.

For this reason I propose the wwchar_t type with an API that is similar
to POSIX <wctype.h> but includes the surrogate handling, rather than
pushing it into each application's code.


Bruno
--=20
In memoriam Carl Friedrich Goerdeler <
http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdel...

--0016e6d647db041d1a049b512b2a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p>Using -fno-short-wchar will avoid to change the api.</p>
<p>Bastien</p>
<p><blockquote type=3D"cite">Le=A02 f=E9vr. 2011 18:42, &quot;Bruno Haible&=
quot; &lt;<a href=3D"mailto:bruno@HIDDEN">bruno@HIDDEN</a>&gt;=A0a =
=E9crit=A0:<br><br>Hello Corinna,<br>
<br>
&gt; And, please note the wording in SUSv4, for instance in<br>
&gt; <a href=3D"http://calimero.vinschen.de/susv4/functions/iswalpha.html" =
target=3D"_blank">http://calimero.vinschen.de/susv4/functions/iswalpha.html=
</a><br>
<br>
Likewise in POSIX:2008, at the URL<br>
<a href=3D"http://www.opengroup.org/onlinepubs/9699919799/functions/iswalph=
a.html" target=3D"_blank">http://www.opengroup.org/onlinepubs/9699919799/fu=
nctions/iswalpha.html</a><br>
<br>
&gt; =A0 The wc argument is a wint_t, the value of which the application sh=
all<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0^^^^^^ =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ^^^^^^^^^^^<br>
&gt; =A0 ensure is a wide-character code corresponding to a valid character=
 in<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=
^^^^^^^^^<br>
&gt; =A0 the current locale, or equal to the value of the macro WEOF. If th=
e<br>
&gt; =A0 argument has any other value, the behavior is undefined.<br>
<br>
What this sentence means in formulas, is that when an application passes<br=
>
a &#39;wint_t x&#39; to iswalpha(), it has to satisfy<br>
<br>
 =A0 x =3D=3D (wint_t) (wchar_t) x || x =3D=3D EOF<br>
<br>
&gt; iswalpha takes wint_t, not wchar_t. =A0Since sizeof (wint_t) is 4 byte=
,<br>
&gt; the function can return the correct value, provided that the applicati=
on<br>
&gt; converts the UTF-16 surrogate to UTF-32 before calling iswalpha.<br>
<br>
When an application does this, is passes an invalid wint_t value to<br>
iswalpha(), according to the spec paragraph that you have just cited.<br>
So the application uses an extension to POSIX functionality, not<br>
POSIX itself.<br>
<br>
I see that Cygwin 1.7.x iswalpha() works in this way you describe (but<br>
mingw&#39;s iswalpha() doesn&#39;t). So this means that gnulib&#39;s propos=
ed<br>
iswwalpha(wwchar_t) function could be implemented using iswalpha()<br>
on Cygwin 1.7.x and will not cause the Unicode based tables to be<br>
included in the executable. This is good and nice.<br>
<br>
But if you say that the application should convert UTF-16 surrogates<br>
to UTF-32 before calling iswalpha: That&#39;s certainly a requirement<br>
for Cygwin 1.7.x application that want to support the entire Unicode<br>
character set. But it&#39;s outside of POSIX, and many GNU programs will<br=
>
not want to include this added complexity. Just try to apply this<br>
suggestion to gnulib&#39;s quotearg.c, then estimate the time someone<br>
would need to apply it also to regcomp.c, strftime.c, mbscasestr.c,<br>
coreutils/src/wc.c, and so on.<br>
<br>
For this reason I propose the wwchar_t type with an API that is similar<br>
to POSIX &lt;wctype.h&gt; but includes the surrogate handling, rather than<=
br>
pushing it into each application&#39;s code.<br>
<p><font color=3D"#500050"><br>Bruno<br>-- <br>In memoriam Carl Friedrich G=
oerdeler &lt;<a href=3D"http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdel=
.">http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdel.</a>..</font></p></b=
lockquote>
</p>

--0016e6d647db041d1a049b512b2a--




Acknowledgement sent to Bastien ROUCARIES <roucaries.bastien@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7968; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 19 Oct 2018 16:45:01 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.