GNU bug report logs - #32267
multibyte: dd: add lcase/ucase multibyte support

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Ralph Corderoy <ralph@HIDDEN>; dated Wed, 25 Jul 2018 08:12:02 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'multibyte: dd: add lcase/ucase multibyte support' from 'dd's ucase and lcase and LC_CTYPE.' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 32267 <at> debbugs.gnu.org:


Received: (at 32267) by debbugs.gnu.org; 26 Jul 2018 09:21:47 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 26 05:21:47 2018
Received: from localhost ([127.0.0.1]:58297 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ficSv-0004yU-Ai
	for submit <at> debbugs.gnu.org; Thu, 26 Jul 2018 05:21:45 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:55466)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1ficSs-0004yE-Mh
 for 32267 <at> debbugs.gnu.org; Thu, 26 Jul 2018 05:21:43 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1D588160656;
 Thu, 26 Jul 2018 02:21:37 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id NHOqJjFT8Yqo; Thu, 26 Jul 2018 02:21:36 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6B605160657;
 Thu, 26 Jul 2018 02:21:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id azCQAtdBuJjj; Thu, 26 Jul 2018 02:21:36 -0700 (PDT)
Received: from [192.168.1.9] (unknown [47.154.30.119])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 28D42160656;
 Thu, 26 Jul 2018 02:21:36 -0700 (PDT)
Subject: Re: bug#32267: dd's ucase and lcase and LC_CTYPE.
To: Ralph Corderoy <ralph@HIDDEN>, 32267 <at> debbugs.gnu.org
References: <20180725081111.984911FBFB@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Openpgp: preference=signencrypt
Autocrypt: addr=eggert@HIDDEN; prefer-encrypt=mutual; keydata=
 xsFNBEyAcmQBEADAAyH2xoTu7ppG5D3a8FMZEon74dCvc4+q1XA2J2tBy2pwaTqfhpxxdGA9
 Jj50UJ3PD4bSUEgN8tLZ0san47l5XTAFLi2456ciSl5m8sKaHlGdt9XmAAtmXqeZVIYX/UFS
 96fDzf4xhEmm/y7LbYEPQdUdxu47xA5KhTYp5bltF3WYDz1Ygd7gx07Auwp7iw7eNvnoDTAl
 KAl8KYDZzbDNCQGEbpY3efZIvPdeI+FWQN4W+kghy+P6au6PrIIhYraeua7XDdb2LS1en3Ss
 mE3QjqfRqI/A2ue8JMwsvXe/WK38Ezs6x74iTaqI3AFH6ilAhDqpMnd/msSESNFt76DiO1ZK
 QMr9amVPknjfPmJISqdhgB1DlEdw34sROf6V8mZw0xfqT6PKE46LcFefzs0kbg4GORf8vjG2
 Sf1tk5eU8MBiyN/bZ03bKNjNYMpODDQQwuP84kYLkX2wBxxMAhBxwbDVZudzxDZJ1C2VXujC
 OJVxq2kljBM9ETYuUGqd75AW2LXrLw6+MuIsHFAYAgRr7+KcwDgBAfwhPBYX34nSSiHlmLC+
 KaHLeCLF5ZI2vKm3HEeCTtlOg7xZEONgwzL+fdKo+D6SoC8RRxJKs8a3sVfI4t6CnrQzvJbB
 n6gxdgCu5i29J1QCYrCYvql2UyFPAK+do99/1jOXT4m2836j1wARAQABzSBQYXVsIEVnZ2Vy
 dCA8ZWdnZXJ0QGNzLnVjbGEuZWR1PsLBfgQTAQIAKAUCTIByZAIbAwUJEswDAAYLCQgHAwIG
 FQgCCQoLBBYCAwECHgECF4AACgkQ7ZfpDmKqfjRRGw/+Ij03dhYfYl/gXVRiuzV1gGrbHk+t
 nfrI/C7fAeoFzQ5tVgVinShaPkZo0HTPf18x6IDEdAiO8Mqo1yp0CtHmzGMCJ50o4Grgfjlr
 6g/+vtEOKbhleszN2XpJvpwM2QgGvn/laTLUu8PH9aRWTs7qJJZKKKAb4sxYc92FehPu6FOD
 0dDiyhlDAq4lOV2mdBpzQbiojoZzQLMQwjpgCTK2572eK9EOEQySUThXrSIz6ASenp4NYTFH
 s9tuJQvXk9gZDdPSl3bp+47dGxlxEWLpBIM7zIONw4ks4azgT8nvDZxA5IZHtvqBlJLBObYY
 0Le61Wp0y3TlBDh2qdK8eYL426W4scEMSuig5gb8OAtQiBW6k2sGUxxeiv8ovWu8YAZgKJfu
 oWI+uRnMEddruY8JsoM54KaKvZikkKs2bg1ndtLVzHpJ6qFZC7QVjeHUh6/BmgvdjWPZYFTt
 N+KA9CWX3GQKKgN3uu988yznD7LnB98T4EUH1HA/GnfBqMV1gpzTvPc4qVQinCmIkEFp83zl
 +G5fCjJJ3W7ivzCnYo4KhKLpFUm97okTKR2LW3xZzEW4cLSWO387MTK3CzDOx5qe6s4a91Zu
 ZM/j/TQdTLDaqNn83kA4Hq48UHXYxcIh+Nd8k/3w6lFuoK0wrOFiywjLx+0ur5jmmbecBGHc
 1xdhAFHOwU0ETIByZAEQAKaF678T9wyH4wjTrV1Pz3cDEoSnV/0ZUrOT37p1dcGyj/IXq1x6
 70HRVahAmk0sZpYc25PF9D5GPYHFWlNjuPU96rDndXB3hedmBRhLdC4bAXjI4DV+bmdVe+q/
 IMnlZRaVlm9EiMCVAR6w13sReu7qXkW9r3RwY2AzXskp/tAe4BRKr1Zmbvi2nbnQ6epEC42r
 Rbx0B1EhjbIQZ5JHGk24iPT7LdBgnNmos5wYjzwNlkMQD5T0Ydzhk7J+UxwA5m46mOhRDC2r
 FV/A0gm5TLy8DXjv/Esc4gYnYai6SQqnUEVh5LuV8YCJBnijs+Tiw71x1icmn6xGI45EugJO
 gec+rLypYgpVp4x0HI5T88qBRYCkxH3Kg8Qo+EWNA9A4LRQ9DX8njona0gf0s03tocK8kBN6
 6UoqqPtHBnc4eMgBymCflK12eKfd2YYxnyg9cZazWA5VslvTxpm76hbg5oiAEH/Vg/8MxHyA
 nPhfrgwyPrmJEcVBafdspJnYQxBYNco2LFPIhlOvWh8r4at+s+M3Lb26oUTczlgdW1Sf3SDA
 77BMRnF0FQyE+7AzV79MBN4ykiqaezQxtaF1Fy/tvkhffSo8u+dwG0EgJh+te38gTcISVr0G
 IPplLz6YhjrbHrPRF1CN5UuL9DBGjxuN35RLNVEfta6RUFlR6NctTjvrABEBAAHCwWUEGAEC
 AA8FAkyAcmQCGwwFCRLMAwAACgkQ7ZfpDmKqfjSrHA/+KzAKvTxRhA9MWNLxIyJ7S5uJ16gs
 T3oCjZrBKGEhKMOGX4O0GA6VOEryO7QRCCYah3oxSG38IAnNeiwJXgU9Bzkk85UGbPEd7HGF
 /VSeHCQwWou6jqUDTSDvn9YhNTdG0KXPM74aC+xr2Zow1O2mhXihgWKD0Dw+0LYPnUOsQ0KO
 FxHXXYHmRrS1OZPU59BLvc+TRhIhafSHKLwbXK+6ckkxBx6h8z5ccpG0Qs4bFhdFYnFrEieD
 LoGmnE2YLhdV6swJ9VNCS6pLiEohT3fm7aXm15tZOIyzMZhHRSAPblXxQ0ZSWjq8oRrcYNFx
 c4W1URpAkBCOYJoXvQfD5L3lqAl8TCqDUzYxhH/tJhbDdHrqHH767jaDaTB1+Talp/2AMKwc
 XNOdiklGxbmHVG6YGl6g8Lrbsu9NZEI4yLlHzuikthJWgz+3vZhVGyNlt+HNIoF6CjDL2omu
 5cEq4RDHM44QqPk6l7O0pUvN1mT4B+S1b08RKpqm/ff015E37HNV/piIvJlxGAYz8PSfuGCB
 1thMYqlmgdhd9/BabGFbGGYHA6U4/T5zqU+f6xHy1SsAQZ1MSKlLwekBIT+4/cLRGqCHjnV0
 q5H/T6a7t5mPkbzSrOLSo4puj+IToNjYyYIDBWzhlA19avOa+rvUjmHtD3sFN7cXWtkGoi8b
 uNcby4U=
Organization: UCLA Computer Science Department
Message-ID: <e062523e-4c7b-2f8e-ad51-586b9c45706e@HIDDEN>
Date: Thu, 26 Jul 2018 02:21:35 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <20180725081111.984911FBFB@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 32267
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Yes, this is a known issue with dd as with many other coreutils programs. 
Strictly speaking as I understand it, it is not a deviation from POSIX, since 
POSIX does not require support for locales with multibyte encodings. Still, it 
would be nice to fix dd at some point, although it'd be a pain to do correctly 
and efficiently and it's long been low priority since hardly anybody needs or 
uses this feature on any platform.




Information forwarded to bug-coreutils@HIDDEN:
bug#32267; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 25 Jul 2018 08:11:30 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Jul 25 04:11:30 2018
Received: from localhost ([127.0.0.1]:56100 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1fiEtO-0003KO-A7
	for submit <at> debbugs.gnu.org; Wed, 25 Jul 2018 04:11:30 -0400
Received: from eggs.gnu.org ([208.118.235.92]:50763)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ralph@HIDDEN>) id 1fiEtM-0003K9-Bl
 for submit <at> debbugs.gnu.org; Wed, 25 Jul 2018 04:11:29 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ralph@HIDDEN>) id 1fiEtD-0001wQ-Ur
 for submit <at> debbugs.gnu.org; Wed, 25 Jul 2018 04:11:22 -0400
Received: from lists.gnu.org ([2001:4830:134:3::11]:47303)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ralph@HIDDEN>)
 id 1fiEtD-0001wK-Qp
 for submit <at> debbugs.gnu.org; Wed, 25 Jul 2018 04:11:19 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41694)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <ralph@HIDDEN>) id 1fiEtA-0003VW-Jh
 for bug-coreutils@HIDDEN; Wed, 25 Jul 2018 04:11:19 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ralph@HIDDEN>) id 1fiEt7-0001mv-IX
 for bug-coreutils@HIDDEN; Wed, 25 Jul 2018 04:11:16 -0400
Received: from relay01.pair.com ([209.68.5.15]:49964)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <ralph@HIDDEN>)
 id 1fiEt7-0001mR-D4
 for bug-coreutils@HIDDEN; Wed, 25 Jul 2018 04:11:13 -0400
Received: from orac.inputplus.co.uk (unknown [81.174.201.153])
 by relay01.pair.com (Postfix) with ESMTP id 9D1B4D010E4
 for <bug-coreutils@HIDDEN>; Wed, 25 Jul 2018 04:11:12 -0400 (EDT)
Received: from orac.inputplus.co.uk (orac.inputplus.co.uk [IPv6:::1])
 by orac.inputplus.co.uk (Postfix) with ESMTP id 984911FBFB;
 Wed, 25 Jul 2018 09:11:11 +0100 (BST)
To: bug-coreutils@HIDDEN
From: Ralph Corderoy <ralph@HIDDEN>
Subject: dd's ucase and lcase and LC_CTYPE.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Date: Wed, 25 Jul 2018 09:11:11 +0100
Message-Id: <20180725081111.984911FBFB@HIDDEN>
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.1 (----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.1 (-----)

Hi,

Of dd(1), POSIX says

    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html
    lcase
        Map uppercase characters specified by the LC_CTYPE keyword
        tolower to the corresponding lowercase character.  Characters
        for which no mapping is specified shall not be modified by this
        conversion.=20

and similarly for `ucase'.

But dd in coreutils 8.29-1 on Arch Linux just has a simple 256-byte
translation table that's mapped through tolower(3) or toupper(3).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/tolower.html
describes tolower(3) as handling only `unsigned char' or EOF, and being
the identity function on all values where there isn't a lowercase letter
for the uppercase value.

This deviation isn't documented AFAICS.  It means ASCII and ISO-8859-1
are re-cased just fine.  UTF-8 has its ASCII subset altered, and other
bytes left alone, so the end result is valid UTF-8, but not fully
re-cased.  But charmaps like /usr/share/i18n/charmaps/CP949.gz,
https://en.wikipedia.org/wiki/Unified_Hangul_Code, have variable-length
byte sequences where 0x41, for example, isn't always an ASCII `A' and
thus shouldn't become 0x61, `a'.

Aside from improving the documentation, actually fixing dd to match
POSIX will need to handle the re-cased character being a different
number of bytes; particularly noticeable if the output file is the input
file with `conv=3Dnotrunc'.

    $ locale | grep LC_CTYPE
    LC_CTYPE=3D"en_GB.utf8"
    $
    $ sed 'l; s/./\u&/; l' <<<=C8=BF
    \310\277$
    \342\261\276$
    =E2=B1=BE
    $ sed 'l; s/./\l&/; l' <<<=E2=B1=BE
    \342\261\276$
    \310\277$
    =C8=BF
    $

--=20
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy




Acknowledgement sent to Ralph Corderoy <ralph@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to bug-coreutils@HIDDEN:
bug#32267; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Tue, 30 Oct 2018 04:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.