GNU bug report logs - #7948
multibyte: 16-bit wchar_t on Windows and Cygwin

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Eric Blake <eblake@HIDDEN>; merged with #7963, #7968; dated Mon, 31 Jan 2011 16:51:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: 16-bit wchar_t on Windows and Cygwin' from '16-bit wchar_t on Windows and Cygwin' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Merged 7948 7963 7968. Request was from era eriksson <era@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 3 Feb 2011 12:49:38 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Feb 03 07:49:38 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Pkyd0-0005um-IQ
	for submit <at> debbugs.gnu.org; Thu, 03 Feb 2011 07:49:38 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <Ulf.Zibis@HIDDEN>) id 1Pkycy-0005uZ-8k
	for submit <at> debbugs.gnu.org; Thu, 03 Feb 2011 07:49:36 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Ulf.Zibis@HIDDEN>) id 1Pkyl9-0001li-Ox
	for submit <at> debbugs.gnu.org; Thu, 03 Feb 2011 07:58:04 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:56281)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Ulf.Zibis@HIDDEN>) id 1Pkyl9-0001le-Mv
	for submit <at> debbugs.gnu.org; Thu, 03 Feb 2011 07:58:03 -0500
Received: from [140.186.70.92] (port=40062 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pkyl8-0000qk-Sc
	for bug-coreutils@HIDDEN; Thu, 03 Feb 2011 07:58:03 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Ulf.Zibis@HIDDEN>) id 1Pkyl7-0001lE-Tn
	for bug-coreutils@HIDDEN; Thu, 03 Feb 2011 07:58:02 -0500
Received: from mailout-de.gmx.net ([213.165.64.22]:52874)
	by eggs.gnu.org with smtp (Exim 4.71)
	(envelope-from <Ulf.Zibis@HIDDEN>) id 1Pkyl7-0001Z1-BX
	for bug-coreutils@HIDDEN; Thu, 03 Feb 2011 07:58:01 -0500
Received: (qmail invoked by alias); 03 Feb 2011 12:57:33 -0000
Received: from dslb-188-100-063-138.pools.arcor-ip.net (EHLO [127.0.0.1])
	[188.100.63.138]
	by mail.gmx.net (mp002) with SMTP; 03 Feb 2011 13:57:33 +0100
X-Authenticated: #3615077
X-Provags-ID: V01U2FsdGVkX1+pUtB5AH2pyv9CZf8IQEfvOrymZyXzCqI31jNtba
	sStP5H3LtfOaeD
Message-ID: <4D4AA63A.50903@HIDDEN>
Date: Thu, 03 Feb 2011 13:57:30 +0100
From: Ulf Zibis <Ulf.Zibis@HIDDEN>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de;
	rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7
MIME-Version: 1.0
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
References: <201101310304.42975.bruno@HIDDEN>
	<4D46EA2B.1010307@HIDDEN>	<201102021229.04623.bruno@HIDDEN>
	<4D4999BA.2030100@HIDDEN>
In-Reply-To: <4D4999BA.2030100@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Antivirus: avast! (VPS 110203-1, 03.02.2011), Outbound message
X-Antivirus-Status: Clean
X-Y-GMX-Trusted: 0
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 213.165.64.22
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -4.8 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-coreutils <bug-coreutils@HIDDEN>, cygwin <cygwin@HIDDEN>,
	bug-gnulib@HIDDEN, Bruno Haible <bruno@HIDDEN>,
	Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.0 (-----)

Hi,

I think there is a kind of similar bug in discussion on GNU:
bug#7960: [PATCH] fmt: fix formatting multibyte text (bug #7372)

-Ulf


Am 02.02.2011 18:51, schrieb Paul Eggert:
> On 02/02/11 03:29, Bruno Haible wrote:
>>    - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>>      on Windows platforms and to 'wchar_t' otherwise.
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.
>
> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."
>
>
>
>




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Feb 2011 20:35:40 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 02 15:35:40 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PkjQS-0000SC-2f
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 15:35:40 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <andy.koppe@HIDDEN>) id 1PkjQP-0000Ry-Vk
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 15:35:38 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <andy.koppe@HIDDEN>) id 1PkjYV-0004AK-Vr
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 15:44:04 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	RCVD_IN_DNSWL_LOW, T_DKIM_INVALID,
	T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:44140)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <andy.koppe@HIDDEN>) id 1PkjYV-00049J-1Q
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 15:43:59 -0500
Received: from [140.186.70.92] (port=41251 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PkjYM-0006dg-2F
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 15:43:56 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <andy.koppe@HIDDEN>) id 1PkjYF-00045v-1b
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 15:43:47 -0500
Received: from mail-gy0-f169.google.com ([209.85.160.169]:33210)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <andy.koppe@HIDDEN>)
	id 1PkjXq-000415-1I; Wed, 02 Feb 2011 15:43:18 -0500
Received: by gyd10 with SMTP id 10so209664gyd.0
	for <multiple recipients>; Wed, 02 Feb 2011 12:43:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:in-reply-to:references:date
	:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=q5KeEarZzXwjw+7ln2PUh3CiGVJJ2VCY9jYGVxr2/cQ=;
	b=F1PaYKqvU63ibMqN5zoS0wdUioXiGuEkqBx2afz2xu+rOdqHnmKU0lSfVGgrP6W0CW
	70BKitUd2ZHPwTwfr+B/JIjJ52Go8ngQ9UkTGPflVn13PNL7F63eGhlaSt9Nwh8F8DtG
	yQd46o+UOVmmF9qHQJMm3uj7SHaSjXu6DN6jQ=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=v8gcRp0HKPKQmGs3hcmIri7ybEWFCcdwH0eIb18AUchm30C712aeR/Vf+VLNw3cr0L
	+lHjx7zYcPNE3aCBO8u+MArFJPQGMOS9hfm2XXED2S+zJnpQuRXL08ORoy1sh0NFWVwK
	64nKW1LCOYd+O/xu4wsb3L/yaT0ELQNMJATlE=
MIME-Version: 1.0
Received: by 10.151.153.12 with SMTP id f12mr3628911ybo.81.1296679397118; Wed,
	02 Feb 2011 12:43:17 -0800 (PST)
Received: by 10.147.172.19 with HTTP; Wed, 2 Feb 2011 12:43:17 -0800 (PST)
In-Reply-To: <201102021957.07676.bruno@HIDDEN>
References: <201101310304.42975.bruno@HIDDEN>
	<201102021229.04623.bruno@HIDDEN> <4D4999BA.2030100@HIDDEN>
	<201102021957.07676.bruno@HIDDEN>
Date: Wed, 2 Feb 2011 20:43:17 +0000
Message-ID: <AANLkTikRJxssP7OLr7O+DQZr-BpjpEZJ8Pe2uJ=msDbh@HIDDEN>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
From: Andy Koppe <andy.koppe@HIDDEN>
To: cygwin@HIDDEN
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 209.85.160.169
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -5.9 (-----)
X-Debbugs-Envelope-To: submit
Cc: bug-gnulib@HIDDEN, Paul Eggert <eggert@HIDDEN>,
	Eric Blake <eblake@HIDDEN>, bug-coreutils <bug-coreutils@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.9 (-----)

On 2 February 2011 18:57, Bruno Haible wrote:
> Hi Paul,
>
>> > =C2=A0 - Define a type 'wwchar_t' on all platforms, equivalent to uint=
32_t
>> > =C2=A0 =C2=A0 on Windows platforms and to 'wchar_t' otherwise.
>>
>> As a minor point, would it be OK to call this type
>> 'xchar_t' instead? =C2=A0'x' is the successor to 'w', after all,
>> and it can be thought of as an abbreviation for 'eXtended'.
>
> 'wwchar_t' means "wide wide character".
>
> In fact it's not really an "extended" character or "complex character".
> It's just what POSIX calls a 'wchar_t'.

It's extended in the sense that the original Unicode was only 16 bits
wide (which of course is why wchar_t on Windows is 16 bits). Also, I
think 'xchar_t' is less prone to typos, in particular forgetting one
of the dubyas.

Andy




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Feb 2011 18:49:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 02 13:49:23 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Pkhlb-0004v4-3S
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:49:23 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bruno@HIDDEN>) id 1PkhlY-0004us-C5
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:49:20 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1Pkhtd-00039d-8O
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:57:47 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_NONE, 
	T_DKIM_INVALID autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:40470)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1Pkhtd-00039Z-5V
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 13:57:41 -0500
Received: from [140.186.70.92] (port=57830 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PkhtX-0002P8-4C
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 13:57:40 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1PkhtQ-000357-Q3
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 13:57:35 -0500
Received: from mo-p00-ob.rzone.de ([81.169.146.162]:40944)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bruno@HIDDEN>)
	id 1PkhtG-00032A-E9; Wed, 02 Feb 2011 13:57:18 -0500
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; t=1296673036; l=1141;
	s=domk; d=haible.de;
	h=Content-Transfer-Encoding:Content-Type:MIME-Version:In-Reply-To:
	References:Cc:Date:Subject:To:From:X-RZG-CLASS-ID:X-RZG-AUTH;
	bh=rNNLWGarGyDCzfVPXsRkU2mE+W0=;
	b=fLZW5cz2uFCzabcxgNaolvZUWvNX8HFM7IqdCn5Qxyid9KVjzgV4k/2nCoaHy1aYhZB
	KCCUuMTTWMObJs+q++eptHtpzBWidlbJWBxkgTBl2mWn0v0IS9f6APVWsHHBDuPqi+Phc
	xHJIoI9LygULqoUtwSXy6aQX+yWH0AoqltQ=
X-RZG-AUTH: :Ln4Re0+Ic/6oZXR1YgKryK8brksyK8dozXDwHXjf9hj/zDNRbfA44+iwyQ==
X-RZG-CLASS-ID: mo00
Received: from linuix.haible.de
	(dslb-088-068-046-137.pools.arcor-ip.net [88.68.46.137])
	by post.strato.de (mrclete mo38) (RZmta 25.1)
	with ESMTPA id j025a9n12H80pF ; Wed, 2 Feb 2011 19:57:08 +0100 (MET)
From: Bruno Haible <bruno@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 19:57:06 +0100
User-Agent: KMail/1.9.9
References: <201101310304.42975.bruno@HIDDEN>
	<201102021229.04623.bruno@HIDDEN> <4D4999BA.2030100@HIDDEN>
In-Reply-To: <4D4999BA.2030100@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <201102021957.07676.bruno@HIDDEN>
X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta)
X-Received-From: 81.169.146.162
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -5.6 (-----)
X-Debbugs-Envelope-To: submit
Cc: bug-gnulib@HIDDEN, cygwin <cygwin@HIDDEN>,
	bug-coreutils <bug-coreutils@HIDDEN>, Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.6 (-----)

Hi Paul,

> >   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
> >     on Windows platforms and to 'wchar_t' otherwise.
> 
> As a minor point, would it be OK to call this type
> 'xchar_t' instead?  'x' is the successor to 'w', after all,
> and it can be thought of as an abbreviation for 'eXtended'.

'wwchar_t' means "wide wide character".

In fact it's not really an "extended" character or "complex character".
It's just what POSIX calls a 'wchar_t'.

I like the analogy between strtol and strtoll. In the beginning, people
thought a 'long int' would be enough for everything. Then they discovered
a 'long long int' is needed. The same story repeats itself here with
the "wide characters" which turn out to be not wide enough, and
"wide wide characters" are needed.

> A problem with the 'ww' prefix is that mentally I start thinking
> "World Wide ..."

Indeed this meaning can come to mind, but I think it's not dangerous
since the term "world wide" has no meaning in a programming language.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Feb 2011 17:43:39 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 02 12:43:39 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Pkgjz-0002hl-92
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 12:43:39 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <eggert@HIDDEN>) id 1Pkgjx-0002hX-Gs
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 12:43:38 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eggert@HIDDEN>) id 1Pkgs6-0002qO-CH
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 12:52:03 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,T_RP_MATCHES_RCVD
	autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:49425)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eggert@HIDDEN>) id 1Pkgs6-0002qK-AS
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 12:52:02 -0500
Received: from [140.186.70.92] (port=34449 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pkgs5-0003sc-8r
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 12:52:02 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eggert@HIDDEN>) id 1Pkgs3-0002pQ-I5
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 12:52:00 -0500
Received: from smtp.cs.ucla.edu ([131.179.128.62]:34233)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eggert@HIDDEN>)
	id 1Pkgs3-0002oY-Cn; Wed, 02 Feb 2011 12:51:59 -0500
Received: from localhost (localhost.localdomain [127.0.0.1])
	by smtp.cs.ucla.edu (Postfix) with ESMTP id 8A82039E80DF;
	Wed,  2 Feb 2011 09:51:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
	by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id Azivr9SXO1yO; Wed,  2 Feb 2011 09:51:55 -0800 (PST)
Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200])
	by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 109D439E80DB;
	Wed,  2 Feb 2011 09:51:55 -0800 (PST)
Message-ID: <4D4999BA.2030100@HIDDEN>
Date: Wed, 02 Feb 2011 09:51:54 -0800
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
	rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7
MIME-Version: 1.0
To: Bruno Haible <bruno@HIDDEN>
Subject: Re: bug#7948: 16-bit wchar_t on Windows and Cygwin
References: <201101310304.42975.bruno@HIDDEN> <4D46EA2B.1010307@HIDDEN>
	<201102021229.04623.bruno@HIDDEN>
In-Reply-To: <201102021229.04623.bruno@HIDDEN>
X-Enigmail-Version: 1.1.2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 131.179.128.62
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
Cc: bug-gnulib@HIDDEN, cygwin <cygwin@HIDDEN>,
	bug-coreutils <bug-coreutils@HIDDEN>, Eric Blake <eblake@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.0 (-----)

On 02/02/11 03:29, Bruno Haible wrote:
>   - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
>     on Windows platforms and to 'wchar_t' otherwise.

As a minor point, would it be OK to call this type
'xchar_t' instead?  'x' is the successor to 'w', after all,
and it can be thought of as an abbreviation for 'eXtended'.

A problem with the 'ww' prefix is that mentally I start thinking
"World Wide ..."




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 2 Feb 2011 14:24:00 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 02 09:24:00 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Pkdcl-0006cP-U8
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 09:24:00 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bruno@HIDDEN>) id 1Pkdcj-0006cC-7C
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 09:23:58 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1Pkdkh-0002iM-2S
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 09:32:23 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_NONE, 
	T_DKIM_INVALID autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:58544)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1Pkdkg-0002hZ-QO
	for submit <at> debbugs.gnu.org; Wed, 02 Feb 2011 09:32:11 -0500
Received: from [140.186.70.92] (port=60984 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PkdkZ-0005Lw-OG
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 09:32:12 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bruno@HIDDEN>) id 1Pkatm-0007OW-Hv
	for bug-coreutils@HIDDEN; Wed, 02 Feb 2011 06:29:23 -0500
Received: from mo-p00-ob.rzone.de ([81.169.146.160]:44236)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bruno@HIDDEN>)
	id 1Pkatj-0007Nt-7G; Wed, 02 Feb 2011 06:29:19 -0500
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; t=1296646156; l=7421;
	s=domk; d=haible.de;
	h=Content-Transfer-Encoding:Content-Type:MIME-Version:In-Reply-To:
	References:Cc:Date:Subject:To:From:X-RZG-CLASS-ID:X-RZG-AUTH;
	bh=8lKnBdKfiZWChzYUXVMZU2A2mzs=;
	b=NiQJReQ7zCBdQs/KkCcBzpRLjp9K534R/XewKBe8ioyEtz21ZKNaka0AkIyJWcjwBHD
	JaKCZYJiNrvq3QPpDgmsfJZq2VfjStW9vQHBDedp83Oue+G3w0bBmGSjhbjBdgRnnuDwP
	e7JP3nWeZ4oRt1/gyJcur4bwEXuBI8LskWE=
X-RZG-AUTH: :Ln4Re0+Ic/6oZXR1YgKryK8brksyK8dozXDwHXjf9hj/zDNRbfA44+iwyQ==
X-RZG-CLASS-ID: mo00
Received: from linuix.haible.de
	(dslb-088-068-046-137.pools.arcor-ip.net [88.68.46.137])
	by post.strato.de (klopstock mo12) (RZmta 25.2)
	with ESMTPA id N00214n12B8n4M ; Wed, 2 Feb 2011 12:29:05 +0100 (MET)
From: Bruno Haible <bruno@HIDDEN>
To: Eric Blake <eblake@HIDDEN>
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 12:29:03 +0100
User-Agent: KMail/1.9.9
References: <201101310304.42975.bruno@HIDDEN> <4D46EA2B.1010307@HIDDEN>
In-Reply-To: <4D46EA2B.1010307@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <201102021229.04623.bruno@HIDDEN>
X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta)
X-Received-From: 81.169.146.160
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -5.4 (-----)
X-Debbugs-Envelope-To: submit
Cc: bug-coreutils <bug-coreutils@HIDDEN>, cygwin <cygwin@HIDDEN>,
	bug-gnulib@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.4 (-----)

Hello Eric,

> ... POSIX requires that 1 wchar_t corresponds to 1 character
> ...
> > What consequences does this have?
> > 
> >   1) All code that uses the functions from <wctype.h> (wide character
> >      classification and mapping) or wcwidth() malfunctions on strings that
> >      contains Unicode characters outside the BMP, i.e. outside the range
> >      U+0000..U+FFFF.
> 
> Not necessarily.  Such code falls outside of POSIX, but it may still be
> a well-behaved extension if given sane behavior for how to deal with
> surrogates.

No. Code that uses <wctype.h> and wcwidth() is written precisely according
to POSIX. The problem is that this code cannot work correctly when wchar_t[]
is in UTF-16 encoding. There simply is no way to define these functions
in a reasonable way for surrogates.

For example:
  U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
  U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
  U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
  U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
  U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
  U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
There is no way that a system can provide this information through a
function 'iswalpha' that takes a single wchar_t argument.

It would be possible to provide this information
  - either through a function iswalpha2 (wchar_t wc1, wchar_t wc2)
    that takes two wchar_t arguments,
  - or through a function uc_is_alpha (ucs4_t uc),
but that is not POSIX, and it would require rewriting each and every
piece of code that currently uses <wctype.h> in the POSIX way.

> we can (try) to make the various wc* functions try to
> behave as smartly as possible (as is the case with Cygwin); where those
> smarts are only needed when you use surrogate pairs.

The point is that this approach can work fine for mbrtowc() and wcrtomb(),
but it cannot yield a working definition for the <wctype.h> functions and
wcwidth().

> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $ 	
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

We agree that it is a bug. And it is caused by
  - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
  - there is no way to define the <wctype.h> POSIX functions sanely in this
    setting, and
  - coreutils and gnulib make use of the POSIX functions.

Even if coreutils were to use mbsrtowcs instead of repeated use of
mbrtowc, there would be no way for it to produce the correct result
without combining surrogates into entire characters.

> And if we decide that cygwin's extensions are sane, how much harder is
> it to characterize what a program must do to be portable to both 16-bit
> and 32-bit wchar_t if they are guaranteed the same behavior for all
> hosts of the same-size wchar_t?  In other words, would it really require
> that many #ifdefs in coreutils to portably and simultaneously support
> both sizes of wchar_t?

It would require
  1. to change the conversions that use mbrtowc to either convert an
     entire string at once (use mbsrtowcs), or make a second call to
     mbrtowc once the first call to mbrtowc has determined a low
     surrogate.
  2. to change all uses of <wctype.h> and wcwidth() to use different
     functions, either functions that take 2 wchar_t arguments, or
     functions that require the caller to combine the surrogates.

This means, lots of logic that goes against the spirit of wchar_t
in ANSI C Amd. 1 and POSIX.

> > I'm more in favour of overriding wchar_t and all functions that depend on it -
> > like we did successfully for the socket functions.
> > 
> > In practice, this would mean that on Windows (both native Windows and
> > Cygwin >= 1.7) the use of a 'wchar_t' module will
> >   - override wchar_t to be 32 bits, like in glibc,
> >   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
> >     corresponding system functions are unusable, the replacements will use the
> >     modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).
> ...
> compiler primitives, like L"xyz", which result in 16-bit wchar_t
> arrays, will be unusable

Good point. I agree then that overriding wchar_t should better not be
done.

> C1x will be adding compiler support for mandatory char16_t and char32_t
> types for UTF-16 and UTF-32 data, independently of whether wchar_t is
> 16-bit or 32-bit; maybe the better thing is to proactively start
> providing the new interfaces in <uchar.h> that will result from C1x
> adoption (and convert GNU programs to use this rather than wchar_t for
> character operations)
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

A newer draft is at
https://www.opengroup.org/platform/single_unix_specification/uploads/40/23495/n1548.pdf

This is a good point, but would have two drawbacks:

  - It throws out the use of a POSIX API for a not-yet-standard API,

  - Performance: For the non-UTF-8 locales (ISO-8859-15, EUC-JP, and
    similar) on platforms like MacOS X, FreeBSD, Solaris, the 'wchar_t'
    representation is essentially a packed multibyte representation.
    Which makes mbrtowc() fast, because it does not have to do a table
    lookup for the conversion from/to Unicode. If you use mbrtoc32
    instead of mbrtowc, you add extra runtime overhead for a conversion
    to Unicode, that would not be necessary when using mbrtowc().

In other words, your proposal would solve the Windows wchar_t problem,
but at the price of a performance penalty on traditional Unix systems.

Here's a new proposal:
  - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
    on Windows platforms and to 'wchar_t' otherwise.
  - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
    Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
    'wcwidth' on most platforms, and a use of libunistring modules on
    Windows platforms.

With this proposal,

  - The code that uses <wctype.h> has to be changed, but in a trivial
    way that introduces no complicated logic: Just change 'w' to 'ww'.
    Not more difficult than, say, using strtoll() instead of strtol().

  - The runtime penalty on non-Windows systems is minimal.

  - On Windows platforms, surrogates are handled correctly, and
    code that uses wchar_t or <windows.h> is left alone.

How does that sound? Comments?

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 31 Jan 2011 16:50:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jan 31 11:50:23 2011
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1PjwxK-0007y1-W1
	for submit <at> debbugs.gnu.org; Mon, 31 Jan 2011 11:50:23 -0500
Received: from eggs.gnu.org ([140.186.70.92])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <eblake@HIDDEN>) id 1PjwxI-0007xp-GR
	for submit <at> debbugs.gnu.org; Mon, 31 Jan 2011 11:50:22 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eblake@HIDDEN>) id 1Pjx5M-0004Q3-Cc
	for submit <at> debbugs.gnu.org; Mon, 31 Jan 2011 11:58:41 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI,
	T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.1
Received: from lists.gnu.org ([199.232.76.165]:42744)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eblake@HIDDEN>) id 1Pjx5M-0004Px-9S
	for submit <at> debbugs.gnu.org; Mon, 31 Jan 2011 11:58:40 -0500
Received: from [140.186.70.92] (port=46328 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pjx5K-00005y-HL
	for bug-coreutils@HIDDEN; Mon, 31 Jan 2011 11:58:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eblake@HIDDEN>) id 1Pjx5I-0004OT-Sw
	for bug-coreutils@HIDDEN; Mon, 31 Jan 2011 11:58:38 -0500
Received: from mx1.redhat.com ([209.132.183.28]:45651)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eblake@HIDDEN>)
	id 1Pjx5F-0004MY-D2; Mon, 31 Jan 2011 11:58:33 -0500
Received: from int-mx02.intmail.prod.int.phx2.redhat.com
	(int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id p0VGwL0L023207
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Mon, 31 Jan 2011 11:58:21 -0500
Received: from [10.3.113.114] (ovpn-113-114.phx2.redhat.com [10.3.113.114])
	by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP
	id p0VGwJ9l013308; Mon, 31 Jan 2011 11:58:20 -0500
Message-ID: <4D46EA2B.1010307@HIDDEN>
Date: Mon, 31 Jan 2011 09:58:19 -0700
From: Eric Blake <eblake@HIDDEN>
Organization: Red Hat
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
	rv:1.9.2.13) Gecko/20101209 Fedora/3.1.7-0.35.b3pre.fc14
	Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.7
MIME-Version: 1.0
To: Bruno Haible <bruno@HIDDEN>
Subject: Re: 16-bit wchar_t on Windows and Cygwin
References: <201101310304.42975.bruno@HIDDEN>
In-Reply-To: <201101310304.42975.bruno@HIDDEN>
X-Enigmail-Version: 1.1.2
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature";
	boundary="------------enig94CF3FEB4BA742E2A08505A3"
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 209.132.183.28
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 199.232.76.165
X-Spam-Score: -7.9 (-------)
X-Debbugs-Envelope-To: submit
Cc: bug-coreutils <bug-coreutils@HIDDEN>, cygwin <cygwin@HIDDEN>,
	bug-gnulib@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -7.9 (-------)

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig94CF3FEB4BA742E2A08505A3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

[adding cygwin and coreutils for a wc issue]

On 01/30/2011 07:04 PM, Bruno Haible wrote:
> Hi,
>=20
> It is known for a long time that on native Windows, the wchar_t[] encod=
ing on
> strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is t=
he same
> for Cygwin >=3D 1.7. [2]

POSIX requires that 1 wchar_t corresponds to 1 character; so any use of
surrogates to get the full benefit of UTF-16 falls outside the bounds of
POSIX.  At which point, the POSIX definition of those functions no
longer apply, and we can (try) to make the various wc* functions try to
behave as smartly as possible (as is the case with Cygwin); where those
smarts are only needed when you use surrogate pairs.  If cygwin's
approach is correct, then maybe the thing to do is codify those smarts
for all implementations with 16-bit wchar_t as an extension to POSIX
that all gnulib clients can rely on, and thus minimize the #ifdefs in
such clients.

> What consequences does this have?
>=20
>   1) All code that uses the functions from <wctype.h> (wide character
>      classification and mapping) or wcwidth() malfunctions on strings t=
hat
>      contains Unicode characters outside the BMP, i.e. outside the rang=
e
>      U+0000..U+FFFF.

Not necessarily.  Such code falls outside of POSIX, but it may still be
a well-behaved extension if given sane behavior for how to deal with
surrogates.

>   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunctio=
n.
>      On Cygwin >=3D 1.7 mbrtowc() and wcrtomb() is implemented in an in=
telligent
>      but somewhat surprising way: wcrtomb() may return 0, that is, prod=
uce no
>      output bytes when it consumes a wchar_t.

>   Now with a chinese character outside the BMP:
>   $ =09
>         1       4
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         3       6
>=20
>   On Cygwin 1.7.5 (with LANG=3DC.UTF-8 and 'wc' from GNU coreutils 8.5)=
:
>=20
>   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
>         1       5
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         2       7
>
>   So both the number of characters and the number of words are counted
>   wrong as soon as non-BMP characters occur.
>

Does this represent a bug in cygwin's mbrtowc routines that could be
fixed by cygwin?

Or, does this represent a bug in coreutils for using mbrtowc one
character at a time instead of something like mbsrtowcs to do bulk
conversions?

And if we decide that cygwin's extensions are sane, how much harder is
it to characterize what a program must do to be portable to both 16-bit
and 32-bit wchar_t if they are guaranteed the same behavior for all
hosts of the same-size wchar_t?  In other words, would it really require
that many #ifdefs in coreutils to portably and simultaneously support
both sizes of wchar_t?

> I'm more in favour of overriding wchar_t and all functions that depend =
on it -
> like we did successfully for the socket functions.
>=20
> In practice, this would mean that on Windows (both native Windows and
> Cygwin >=3D 1.7) the use of a 'wchar_t' module will
>   - override wchar_t to be 32 bits, like in glibc,
>   - cause functions from mbrtowc() to wcwidth() to be overridden. Since=
 the
>     corresponding system functions are unusable, the replacements will =
use the
>     modules from libunistring (such as unictype/ctype-alnum and uniwidt=
h/width).

That's a lot of overriding, for anything that uses wchar_t in its API,
and throws out a lot of what cygwin already provides.  It also means
that compiler primitives, like L"xyz", which result in 16-bit wchar_t
arrays, will be unusable with your 32-bit wchar_t override.  In other
words, I don't think it's a good idea to be doing that.

C1x will be adding compiler support for mandatory char16_t and char32_t
types for UTF-16 and UTF-32 data, independently of whether wchar_t is
16-bit or 32-bit; maybe the better thing is to proactively start
providing the new interfaces in <uchar.h> that will result from C1x
adoption (and convert GNU programs to use this rather than wchar_t for
character operations), although without compiler support for u"" and U""
(and even u8""), we are no better than ditching compiler support for L""
if you force a wchar_t size override.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

7.27 Unicode utilities <uchar.h>
1 The header <uchar.h> declares types and functions for manipulating Unic=
ode
 characters.
2 The types declared are mbstate_t (described in 7.29.1) and size_t
(described in
 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the
same type as
uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the
same type as
uint_least32_t (also described in 7.20.1.2).

mbrtoc16
c16rtomb
mbrtoc32
c32rtomb

but no variants for replacing wprintf and friends (convert to multibyte
and use printf and friends instead).

--=20
Eric Blake   eblake@HIDDEN    +1-801-349-2682
Libvirt virtualization library http://libvirt.org


--------------enig94CF3FEB4BA742E2A08505A3
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQEcBAEBCAAGBQJNRuorAAoJEKeha0olJ0Nq75oH/RpS/V6+I5kdmDbm3JNIQeS5
SwN7b6/jhycI9Hs5y/MvjSfo0auhwstLyGPutmqtDTAnJ3TRjO/NDUshuBo3vDMg
6jLLzYwqKRAyEFMmSpLygON8UIgrAScJxb5gEmRwzW1m6Y4zZojfVDpO/qRmhXfJ
y+9rSgDhpU4ex3Pevg9IuGFHVNh11ClNEFm96cJjFYLK46zQXyGaY6UrZO6CkcYf
bVwzLD5nWx3btYi75XdBppPvx1hA9q6e291BrAgf6IU1zhq76TX9k9D9HZIu7FEh
bv8gDkYy/T5FCF4+qo2/TtOvAX3H9kbkwPUziH8lQ+fcbbt5euRvCbM/HjkfSN0=
=m8Gr
-----END PGP SIGNATURE-----

--------------enig94CF3FEB4BA742E2A08505A3--




Acknowledgement sent to Eric Blake <eblake@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#7948; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.