Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 20:45:54 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 16:45:54 2018 Received: from localhost ([127.0.0.1]:58004 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1f8XU9-000429-Hp for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:45:53 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33518) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1f8XU7-0003uP-KT for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:45:52 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B327F1615FB; Tue, 17 Apr 2018 13:45:45 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id n3FFr21POCVP; Tue, 17 Apr 2018 13:45:40 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E3958161611; Tue, 17 Apr 2018 13:45:40 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id R-OmRoN61aGZ; Tue, 17 Apr 2018 13:45:40 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C8DD51615FB; Tue, 17 Apr 2018 13:45:40 -0700 (PDT) Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for Unicode? To: Keepun <keepun@HIDDEN>, 31185 <at> debbugs.gnu.org References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN> <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN> <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Message-ID: <3cdfd123-75fa-2783-df8c-c0250236334b@HIDDEN> Date: Tue, 17 Apr 2018 13:45:40 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31185 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) On 04/17/2018 01:27 PM, Keepun wrote: > why there are no plans to support UTF-16 and UTF-32? Nobody has volunteered to do it, and there hasn't been a pressing need. UTF-16 and UTF-32 are primarily used for internal representation, not for text files. For more on the subject, please see: http://utf8everywhere.org/
bug-diffutils@HIDDEN
:bug#31185
; Package diffutils
.
Full text available.Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 20:27:50 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 16:27:50 2018 Received: from localhost ([127.0.0.1]:57976 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1f8XCg-0000i0-7L for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:27:50 -0400 Received: from mail-lf0-f49.google.com ([209.85.215.49]:34587) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <keepun@HIDDEN>) id 1f8XCd-0000hm-SS for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:27:48 -0400 Received: by mail-lf0-f49.google.com with SMTP id r7-v6so22155285lfr.1 for <31185 <at> debbugs.gnu.org>; Tue, 17 Apr 2018 13:27:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=rWafjlihYJsqvJsZ2qKivqTJBKJ+A4eqCIfeGFsy9mw=; b=TGyTkQ3PRVs+eh3niNv164FFNzDHsxRK091AyxG5y4bI2zwsXXj4SegfoxTrd6YgbR J1V1hhyR9WBJy/MntjsKMQrnfGkDltZ7WdAWBOzTPqFGFjnennOcFmz5XmU1bC885BN5 +4KTB1xw1bmqHyBgWBZvVbVrjSCC80cvmmdqvShaHic+tv/A62hN0ZzNQbonOsUyHgTl am6w7KcuWx4x2t9t8OA4OLC9h6LU+gNqbhTd2g+n/28useRz5aG6PoRpymyLShov4f4G V2DR/oBXIUz94GRCsBLTueqWSaItGxPbYCa1d83LWZZ6L2zfw42SkI/2idt5hYo/XjiT NDpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=rWafjlihYJsqvJsZ2qKivqTJBKJ+A4eqCIfeGFsy9mw=; b=gt4keW3HOA+wbiumKFjm0zzPtHayZPpl2BhDq8+e9Evgl04veSw8ZofS8v7OKrOHkG B89mE1KZ5RLzchLC7AVQjRPdi3vCwb+FBWBJpTQf8VqdTQqQGCJQ1p3WJcML/zx6T9+B dFwEV+N4biQSI+u//meQHeDeLjIu7XXvro76Kacxyki7ULUdS7xETrEDUV1IO87aE6tN dxcWyyVsL1qcx69UhCcGFzQxwSPgf2IiUopx5REwOs+vFcCYp5o2zadAlyE3mxrF1I2E 9tqYg0n8dgE9ijUktgipu1veQkerPBXJHuIkC+d2tL5cTuFvfwR6KgF2DKV4+S+Whwe4 yxyg== X-Gm-Message-State: ALQs6tBVqIkJCe6Estptx0pLnQFTGibWLWAQklx16CxvOz9ipwajHPu7 MnpSr19eF5H+BlTDQPtNdEXKWyHFcuM= X-Google-Smtp-Source: AIpwx4/IhBARo/L3tfq5pBkAB35OEyv05sgyEzAKhb5NJKV2+DtI00vbxmcuUfGtl2ZFiGSAXWBBlQ== X-Received: by 10.46.48.7 with SMTP id w7mr1993505ljw.73.1523996861283; Tue, 17 Apr 2018 13:27:41 -0700 (PDT) Received: from [192.168.0.3] ([5.142.206.109]) by smtp.gmail.com with ESMTPSA id s6sm2005915ljh.53.2018.04.17.13.27.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 17 Apr 2018 13:27:40 -0700 (PDT) Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for Unicode? To: Paul Eggert <eggert@HIDDEN>, 31185 <at> debbugs.gnu.org References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN> <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN> From: Keepun <keepun@HIDDEN> Message-ID: <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN> Date: Tue, 17 Apr 2018 23:27:36 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN> Content-Type: multipart/alternative; boundary="------------2C700FED09D584661B5FE60C" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 31185 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) This is a multi-part message in MIME format. --------------2C700FED09D584661B5FE60C Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always present. Files with UTF-16 and UTF-32 without the BOM should be identified as binary. But why there are no plans to support UTF-16 and UTF-32? Diff is part of the Git and is used all over the world. Now 2018 and Unicode solved problems with encodings. 17.04.2018 10:37, Paul Eggert: > Keepun wrote: >> Files with encoding greater than 8 bits without BOM at the beginning >> can be immediately identified as binary. > > No, the BOM is not required or recommended in UTF-8, so it would be a > mistake to identify GNU/Linux text files as binary merely because they > lack a BOM. Typically these files do not have a BOM, and when they do > one of the first things many users do is remove the BOM because it can > cause trouble in practice. > > Diffutils does not support UTF-16, where a BOM would make more sense, > and there are no plans to add support for UTF-16 (or for UTF-32, for > that matter). --------------2C700FED09D584661B5FE60C Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <p><span id="result_box" class="" lang="en"><span>UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always present.</span> <span>Files with UTF-16 and UTF-32 without the BOM should be identified as binary.</span><br> <br> <span>But why there are no plans to support UTF-16 and UTF-32?</span> Diff <span class="">is part of the Git and is used all over the world.</span> <span class="">Now 2018 and Unicode solved problems with encodings.</span></span></p> <br> <div class="moz-cite-prefix">17.04.2018 10:37, Paul Eggert:<br> </div> <blockquote cite="mid:1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN" type="cite"> <div class="moz-text-flowed" style="font-family: -moz-fixed; font-size: 14px;" lang="x-unicode">Keepun wrote: <br> <blockquote type="cite" style="color: #000000;">Files with encoding greater than 8 bits without BOM at the beginning can be immediately identified as binary. <br> </blockquote> <br> No, the BOM is not required or recommended in UTF-8, so it would be a mistake to identify GNU/Linux text files as binary merely because they lack a BOM. Typically these files do not have a BOM, and when they do one of the first things many users do is remove the BOM because it can cause trouble in practice. <br> <br> Diffutils does not support UTF-16, where a BOM would make more sense, and there are no plans to add support for UTF-16 (or for UTF-32, for that matter). <br> </div> </blockquote> <br> </body> </html> --------------2C700FED09D584661B5FE60C--
bug-diffutils@HIDDEN
:bug#31185
; Package diffutils
.
Full text available.Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 07:37:26 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 03:37:26 2018 Received: from localhost ([127.0.0.1]:56628 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1f8LB8-0000MU-AT for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 03:37:26 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:37582) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1f8LB7-0000MH-Bb for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 03:37:25 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 56997161244; Tue, 17 Apr 2018 00:37:19 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id nFHaFw1htlch; Tue, 17 Apr 2018 00:37:18 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 948D81616D8; Tue, 17 Apr 2018 00:37:18 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id hH8cQLF4Yfos; Tue, 17 Apr 2018 00:37:18 -0700 (PDT) Received: from [192.168.1.9] (unknown [47.154.30.119]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 71884161244; Tue, 17 Apr 2018 00:37:18 -0700 (PDT) Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for Unicode? To: Keepun <keepun@HIDDEN>, 31185 <at> debbugs.gnu.org References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Message-ID: <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN> Date: Tue, 17 Apr 2018 00:37:18 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31185 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) Keepun wrote: > Files with encoding greater than 8 bits without BOM at the beginning can be > immediately identified as binary. No, the BOM is not required or recommended in UTF-8, so it would be a mistake to identify GNU/Linux text files as binary merely because they lack a BOM. Typically these files do not have a BOM, and when they do one of the first things many users do is remove the BOM because it can cause trouble in practice. Diffutils does not support UTF-16, where a BOM would make more sense, and there are no plans to add support for UTF-16 (or for UTF-32, for that matter).
bug-diffutils@HIDDEN
:bug#31185
; Package diffutils
.
Full text available.Received: (at submit) by debbugs.gnu.org; 16 Apr 2018 22:01:23 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 16 18:01:23 2018 Received: from localhost ([127.0.0.1]:56334 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1f8CBe-0008J1-JO for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 18:01:22 -0400 Received: from eggs.gnu.org ([208.118.235.92]:43185) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <keepun@HIDDEN>) id 1f8BpU-0007mh-Jw for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpO-00014E-93 for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:23 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,FREEMAIL_FROM, HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:44127) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpO-000147-50 for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:22 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34121) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpM-0000M4-LM for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpH-0000zJ-LL for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:20 -0400 Received: from mail-lf0-x22e.google.com ([2a00:1450:4010:c07::22e]:41427) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpH-0000y8-AB for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:15 -0400 Received: by mail-lf0-x22e.google.com with SMTP id m202-v6so7027044lfe.8 for <bug-diffutils@HIDDEN>; Mon, 16 Apr 2018 14:38:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:from:subject:message-id:date:user-agent:mime-version; bh=O3fion1EOyZ9beuYU8i9Wtcxc9+T+SxV/d4hqgVCPLQ=; b=KKZtEDEAgThA2rSEw0RI/7LU9bvK1KH4SPzhFKfpM6mefdlxI5tQT+E2V+tzag0RY/ uLcmTDvHqv2abJc3J6Qgw5P8JcSbM4zib47sLJlVzrKcNqVuzD1YtvPIw1gEwCq+SUhR z3VAqI3p9oMTnPhkc5dsaTQfI+Ylmilasfg2Kqq+mekfO8BcyFhFBQexUnLEZY64hpMR LKhf5X78qiDyn1VE+lwHKUDuOALqkS2Iz5PXzHa7pZrIV7+DCcWgbFhgkbIbjIA/k8fY 0FFnZ4IwIXxoRa9bC1gVK3pv1tUn21xFRZ3doBelGMyHzoV7wSy7cOV3GZDBbVUVsvOQ o47g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:subject:message-id:date:user-agent :mime-version; bh=O3fion1EOyZ9beuYU8i9Wtcxc9+T+SxV/d4hqgVCPLQ=; b=HqYnBuePKIl2DaetXr3LDC7Mo5b03i+dDA1bbBMaqU2JJKsR0Imyus8ywbb/NiMQ4b CdtQ/ayuhYq0yZo+gEK578oDoWxXf5ao9sxlP1c2lAygB90iyYvWGZxp+NWyW5bNqSKR 9KBSyZESFM8d4N5U+FXhOc9WoQuHNFPE7x+rSX++78udkQcn+DT2cda17SfyHEVScpS4 EBm6Ppxxu2/7AnhBpV6DwdY2eBYgjXVLCqQ/9cXhnTNnROewnG1BmAcScwsFUm57yuMI Cn+QCdq6dipEJiDzErkNMk01ri5o14b5qMad1UeJIs+NMnTzINIYPP6Q2lfQH42HBZjE Tt/w== X-Gm-Message-State: ALQs6tCFfWEJkkvt+qJmaEJhRcRw20RAp945wRbGk6+VOtCakDH6bpr9 Jsb910yHmZKzwncxG0AGRf9P/yD0Lgc= X-Google-Smtp-Source: AIpwx4+g0KxQOp5gxK5kSm0xVVkh8RDdeKd23xDvOi/I6pHGrmUezHOKlYC+kf6exxZcjqPQ48I6KQ== X-Received: by 2002:a19:9553:: with SMTP id x80-v6mr15263350lfd.74.1523914693377; Mon, 16 Apr 2018 14:38:13 -0700 (PDT) Received: from [192.168.0.3] ([5.142.206.109]) by smtp.gmail.com with ESMTPSA id s9-v6sm16543lfk.28.2018.04.16.14.38.12 for <bug-diffutils@HIDDEN> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Apr 2018 14:38:12 -0700 (PDT) To: bug-diffutils@HIDDEN From: Keepun <keepun@HIDDEN> Subject: Why is there no full support for Unicode? Message-ID: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN> Date: Tue, 17 Apr 2018 00:38:10 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="------------1271067F6861B3321FDA72A8" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 16 Apr 2018 18:01:20 -0400 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -5.0 (-----) This is a multi-part message in MIME format. --------------1271067F6861B3321FDA72A8 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Why is there no full support for Unicode? Set the encoding using BOM. The status of the binary file should be given only after checking 0x00 characters. BOM is part of the Unicode standard. http://www.unicode.org/faq/utf_bom.html#bom4 Files with encoding greater than 8 bits without BOM at the beginning can be immediately identified as binary. My function in C#: /// <summary> /// </summary> /// <param name="stream"></param> /// <returns>null - binary</returns> public static Encoding GetEncodingStream(Stream stream) { BinaryReader bin = new BinaryReader(stream); byte[] bom = new byte[4]; bin.BaseStream.Seek(0, SeekOrigin.Begin); bin.BaseStream.Read(bom, 0, bom.Length); bin.BaseStream.Seek(0, SeekOrigin.Begin); if (bom[0] == 0x00 && bom[1] == 0x00 && bom[2] == 0xFE && bom[3] == 0xFF) { return new UTF32Encoding(true, true); // UTF-32, big-endian } else if (bom[0] == 0xFE && bom[1] == 0xFF) { return new UnicodeEncoding(true, true); // UTF-16, big-endian } else if (bom[0] == 0xFF && bom[1] == 0xFE) { if (bom[2] == 0x00 && bom[2] == 0x00) { return new UTF32Encoding(false, true); // UTF-32, little-endian } else { return new UnicodeEncoding(false, true); // UTF-16, little-endian } } else if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) { return new UTF8Encoding(true); } else { bool binary = false; long fsize = bin.BaseStream.Length; if (fsize > 100000) { fsize = 100000; } byte[] bts = new byte[fsize]; bin.BaseStream.Seek(0, SeekOrigin.Begin); bin.BaseStream.Read(bts, 0, (int)fsize); bin.BaseStream.Seek(0, SeekOrigin.Begin); for (int x = 0; x < fsize; x++) { if (bts[x] == 0) { binary = true; break; } } if (binary) { return null; } return Encoding.Default; } } --------------1271067F6861B3321FDA72A8 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body bgcolor="#FFFFFF" text="#000000"> <span id="result_box" class="" lang="en"><span class="">Why is there no full support for Unicode?</span><br> <br> <span class="">Set the encoding using BOM.</span><br> <br> <span class="">The status of the binary file should be given only after checking 0x00 characters.</span><br> <br> <span class="">BOM is part of the Unicode standard.</span> <span><a class="moz-txt-link-freetext" href="http://www.unicode.org/faq/utf_bom.html#bom4">http://www.unicode.org/faq/utf_bom.html#bom4</a></span><br> <br> <span class="">Files with encoding greater than 8 bits without BOM at the beginning can be immediately identified as binary.</span><br> <br> <span class="">My function in C#:<br> <br> </span></span>/// <summary><br> /// </summary><br> /// <param name="stream"></param><br> /// <returns>null - binary</returns><br> public static Encoding GetEncodingStream(Stream stream)<br> {<br> BinaryReader bin = new BinaryReader(stream);<br> byte[] bom = new byte[4];<br> bin.BaseStream.Seek(0, SeekOrigin.Begin);<br> bin.BaseStream.Read(bom, 0, bom.Length);<br> bin.BaseStream.Seek(0, SeekOrigin.Begin);<br> if (bom[0] == 0x00 && bom[1] == 0x00 && bom[2] == 0xFE && bom[3] == 0xFF) {<br> return new UTF32Encoding(true, true); // UTF-32, big-endian<br> } else if (bom[0] == 0xFE && bom[1] == 0xFF) {<br> return new UnicodeEncoding(true, true); // UTF-16, big-endian<br> } else if (bom[0] == 0xFF && bom[1] == 0xFE) {<br> if (bom[2] == 0x00 && bom[2] == 0x00) {<br> return new UTF32Encoding(false, true); // UTF-32, little-endian<br> } else {<br> return new UnicodeEncoding(false, true); // UTF-16, little-endian<br> }<br> } else if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) {<br> return new UTF8Encoding(true);<br> } else {<br> bool binary = false;<br> long fsize = bin.BaseStream.Length;<br> if (fsize > 100000) {<br> fsize = 100000;<br> }<br> byte[] bts = new byte[fsize];<br> bin.BaseStream.Seek(0, SeekOrigin.Begin);<br> bin.BaseStream.Read(bts, 0, (int)fsize);<br> bin.BaseStream.Seek(0, SeekOrigin.Begin);<br> for (int x = 0; x < fsize; x++) {<br> if (bts[x] == 0) {<br> binary = true;<br> break;<br> }<br> }<br> if (binary) {<br> return null;<br> }<br> <br> return Encoding.Default;<br> }<br> }<br> <br> </body> </html> --------------1271067F6861B3321FDA72A8--
Keepun <keepun@HIDDEN>
:bug-diffutils@HIDDEN
.
Full text available.bug-diffutils@HIDDEN
:bug#31185
; Package diffutils
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.