GNU bug report logs - #31185
Why is there no full support for Unicode?

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: diffutils; Reported by: Keepun <keepun@HIDDEN>; dated Mon, 16 Apr 2018 22:02:01 UTC; Maintainer for diffutils is bug-diffutils@HIDDEN.

Message received at 31185 <at> debbugs.gnu.org:


Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 20:45:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 16:45:54 2018
Received: from localhost ([127.0.0.1]:58004 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1f8XU9-000429-Hp
	for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:45:53 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33518)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1f8XU7-0003uP-KT
 for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:45:52 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id B327F1615FB;
 Tue, 17 Apr 2018 13:45:45 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id n3FFr21POCVP; Tue, 17 Apr 2018 13:45:40 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id E3958161611;
 Tue, 17 Apr 2018 13:45:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id R-OmRoN61aGZ; Tue, 17 Apr 2018 13:45:40 -0700 (PDT)
Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C8DD51615FB;
 Tue, 17 Apr 2018 13:45:40 -0700 (PDT)
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
To: Keepun <keepun@HIDDEN>, 31185 <at> debbugs.gnu.org
References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN>
 <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN>
 <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <3cdfd123-75fa-2783-df8c-c0250236334b@HIDDEN>
Date: Tue, 17 Apr 2018 13:45:40 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 31185
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On 04/17/2018 01:27 PM, Keepun wrote:
> why there are no plans to support UTF-16 and UTF-32?

Nobody has volunteered to do it, and there hasn't been a pressing need. 
UTF-16 and UTF-32 are primarily used for internal representation, not 
for text files. For more on the subject, please see:

http://utf8everywhere.org/





Information forwarded to bug-diffutils@HIDDEN:
bug#31185; Package diffutils. Full text available.

Message received at 31185 <at> debbugs.gnu.org:


Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 20:27:50 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 16:27:50 2018
Received: from localhost ([127.0.0.1]:57976 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1f8XCg-0000i0-7L
	for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:27:50 -0400
Received: from mail-lf0-f49.google.com ([209.85.215.49]:34587)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <keepun@HIDDEN>) id 1f8XCd-0000hm-SS
 for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 16:27:48 -0400
Received: by mail-lf0-f49.google.com with SMTP id r7-v6so22155285lfr.1
 for <31185 <at> debbugs.gnu.org>; Tue, 17 Apr 2018 13:27:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=rWafjlihYJsqvJsZ2qKivqTJBKJ+A4eqCIfeGFsy9mw=;
 b=TGyTkQ3PRVs+eh3niNv164FFNzDHsxRK091AyxG5y4bI2zwsXXj4SegfoxTrd6YgbR
 J1V1hhyR9WBJy/MntjsKMQrnfGkDltZ7WdAWBOzTPqFGFjnennOcFmz5XmU1bC885BN5
 +4KTB1xw1bmqHyBgWBZvVbVrjSCC80cvmmdqvShaHic+tv/A62hN0ZzNQbonOsUyHgTl
 am6w7KcuWx4x2t9t8OA4OLC9h6LU+gNqbhTd2g+n/28useRz5aG6PoRpymyLShov4f4G
 V2DR/oBXIUz94GRCsBLTueqWSaItGxPbYCa1d83LWZZ6L2zfw42SkI/2idt5hYo/XjiT
 NDpg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=rWafjlihYJsqvJsZ2qKivqTJBKJ+A4eqCIfeGFsy9mw=;
 b=gt4keW3HOA+wbiumKFjm0zzPtHayZPpl2BhDq8+e9Evgl04veSw8ZofS8v7OKrOHkG
 B89mE1KZ5RLzchLC7AVQjRPdi3vCwb+FBWBJpTQf8VqdTQqQGCJQ1p3WJcML/zx6T9+B
 dFwEV+N4biQSI+u//meQHeDeLjIu7XXvro76Kacxyki7ULUdS7xETrEDUV1IO87aE6tN
 dxcWyyVsL1qcx69UhCcGFzQxwSPgf2IiUopx5REwOs+vFcCYp5o2zadAlyE3mxrF1I2E
 9tqYg0n8dgE9ijUktgipu1veQkerPBXJHuIkC+d2tL5cTuFvfwR6KgF2DKV4+S+Whwe4
 yxyg==
X-Gm-Message-State: ALQs6tBVqIkJCe6Estptx0pLnQFTGibWLWAQklx16CxvOz9ipwajHPu7
 MnpSr19eF5H+BlTDQPtNdEXKWyHFcuM=
X-Google-Smtp-Source: AIpwx4/IhBARo/L3tfq5pBkAB35OEyv05sgyEzAKhb5NJKV2+DtI00vbxmcuUfGtl2ZFiGSAXWBBlQ==
X-Received: by 10.46.48.7 with SMTP id w7mr1993505ljw.73.1523996861283;
 Tue, 17 Apr 2018 13:27:41 -0700 (PDT)
Received: from [192.168.0.3] ([5.142.206.109])
 by smtp.gmail.com with ESMTPSA id s6sm2005915ljh.53.2018.04.17.13.27.39
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 17 Apr 2018 13:27:40 -0700 (PDT)
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
To: Paul Eggert <eggert@HIDDEN>, 31185 <at> debbugs.gnu.org
References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN>
 <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN>
From: Keepun <keepun@HIDDEN>
Message-ID: <b9cfe89a-fbf9-72d1-be65-06f5b1eba227@HIDDEN>
Date: Tue, 17 Apr 2018 23:27:36 +0300
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN>
Content-Type: multipart/alternative;
 boundary="------------2C700FED09D584661B5FE60C"
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 31185
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

This is a multi-part message in MIME format.
--------------2C700FED09D584661B5FE60C
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always 
present. Files with UTF-16 and UTF-32 without the BOM should be 
identified as binary.

But why there are no plans to support UTF-16 and UTF-32? Diff is part of 
the Git and is used all over the world. Now 2018 and Unicode solved 
problems with encodings.


17.04.2018 10:37, Paul Eggert:
> Keepun wrote:
>> Files with encoding greater than 8 bits without BOM at the beginning 
>> can be immediately identified as binary.
>
> No, the BOM is not required or recommended in UTF-8, so it would be a 
> mistake to identify GNU/Linux text files as binary merely because they 
> lack a BOM. Typically these files do not have a BOM, and when they do 
> one of the first things many users do is remove the BOM because it can 
> cause trouble in practice.
>
> Diffutils does not support UTF-16, where a BOM would make more sense, 
> and there are no plans to add support for UTF-16 (or for UTF-32, for 
> that matter).


--------------2C700FED09D584661B5FE60C
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p><span id="result_box" class="" lang="en"><span>UTF-8 does not
          require BOM, but for UTF-16 and UTF-32 BOM is always present.</span>
        <span>Files with UTF-16 and UTF-32 without the BOM should be
          identified as binary.</span><br>
        <br>
        <span>But why there are no plans to support UTF-16 and UTF-32?</span>
        Diff <span class="">is part of the Git and is used all over the
          world.</span> <span class="">Now 2018 and Unicode solved
          problems with encodings.</span></span></p>
    <br>
    <div class="moz-cite-prefix">17.04.2018 10:37, Paul Eggert:<br>
    </div>
    <blockquote
      cite="mid:1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN"
      type="cite">
      <div class="moz-text-flowed" style="font-family: -moz-fixed;
        font-size: 14px;" lang="x-unicode">Keepun wrote:
        <br>
        <blockquote type="cite" style="color: #000000;">Files with
          encoding greater than 8 bits without BOM at the beginning can
          be immediately identified as binary.
          <br>
        </blockquote>
        <br>
        No, the BOM is not required or recommended in UTF-8, so it would
        be a mistake to identify GNU/Linux text files as binary merely
        because they lack a BOM. Typically these files do not have a
        BOM, and when they do one of the first things many users do is
        remove the BOM because it can cause trouble in practice.
        <br>
        <br>
        Diffutils does not support UTF-16, where a BOM would make more
        sense, and there are no plans to add support for UTF-16 (or for
        UTF-32, for that matter).
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------2C700FED09D584661B5FE60C--




Information forwarded to bug-diffutils@HIDDEN:
bug#31185; Package diffutils. Full text available.

Message received at 31185 <at> debbugs.gnu.org:


Received: (at 31185) by debbugs.gnu.org; 17 Apr 2018 07:37:26 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 17 03:37:26 2018
Received: from localhost ([127.0.0.1]:56628 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1f8LB8-0000MU-AT
	for submit <at> debbugs.gnu.org; Tue, 17 Apr 2018 03:37:26 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:37582)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1f8LB7-0000MH-Bb
 for 31185 <at> debbugs.gnu.org; Tue, 17 Apr 2018 03:37:25 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 56997161244;
 Tue, 17 Apr 2018 00:37:19 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id nFHaFw1htlch; Tue, 17 Apr 2018 00:37:18 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 948D81616D8;
 Tue, 17 Apr 2018 00:37:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id hH8cQLF4Yfos; Tue, 17 Apr 2018 00:37:18 -0700 (PDT)
Received: from [192.168.1.9] (unknown [47.154.30.119])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 71884161244;
 Tue, 17 Apr 2018 00:37:18 -0700 (PDT)
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
To: Keepun <keepun@HIDDEN>, 31185 <at> debbugs.gnu.org
References: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <1e6ea3ef-86af-a130-30b2-df5e74207668@HIDDEN>
Date: Tue, 17 Apr 2018 00:37:18 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 31185
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Keepun wrote:
> Files with encoding greater than 8 bits without BOM at the beginning can be 
> immediately identified as binary.

No, the BOM is not required or recommended in UTF-8, so it would be a mistake to 
identify GNU/Linux text files as binary merely because they lack a BOM. 
Typically these files do not have a BOM, and when they do one of the first 
things many users do is remove the BOM because it can cause trouble in practice.

Diffutils does not support UTF-16, where a BOM would make more sense, and there 
are no plans to add support for UTF-16 (or for UTF-32, for that matter).




Information forwarded to bug-diffutils@HIDDEN:
bug#31185; Package diffutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 16 Apr 2018 22:01:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 16 18:01:23 2018
Received: from localhost ([127.0.0.1]:56334 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1f8CBe-0008J1-JO
	for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 18:01:22 -0400
Received: from eggs.gnu.org ([208.118.235.92]:43185)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <keepun@HIDDEN>) id 1f8BpU-0007mh-Jw
 for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:29 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <keepun@HIDDEN>) id 1f8BpO-00014E-93
 for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:23 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,FREEMAIL_FROM,
 HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:44127)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpO-000147-50
 for submit <at> debbugs.gnu.org; Mon, 16 Apr 2018 17:38:22 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:34121)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <keepun@HIDDEN>) id 1f8BpM-0000M4-LM
 for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <keepun@HIDDEN>) id 1f8BpH-0000zJ-LL
 for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:20 -0400
Received: from mail-lf0-x22e.google.com ([2a00:1450:4010:c07::22e]:41427)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <keepun@HIDDEN>) id 1f8BpH-0000y8-AB
 for bug-diffutils@HIDDEN; Mon, 16 Apr 2018 17:38:15 -0400
Received: by mail-lf0-x22e.google.com with SMTP id m202-v6so7027044lfe.8
 for <bug-diffutils@HIDDEN>; Mon, 16 Apr 2018 14:38:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=to:from:subject:message-id:date:user-agent:mime-version;
 bh=O3fion1EOyZ9beuYU8i9Wtcxc9+T+SxV/d4hqgVCPLQ=;
 b=KKZtEDEAgThA2rSEw0RI/7LU9bvK1KH4SPzhFKfpM6mefdlxI5tQT+E2V+tzag0RY/
 uLcmTDvHqv2abJc3J6Qgw5P8JcSbM4zib47sLJlVzrKcNqVuzD1YtvPIw1gEwCq+SUhR
 z3VAqI3p9oMTnPhkc5dsaTQfI+Ylmilasfg2Kqq+mekfO8BcyFhFBQexUnLEZY64hpMR
 LKhf5X78qiDyn1VE+lwHKUDuOALqkS2Iz5PXzHa7pZrIV7+DCcWgbFhgkbIbjIA/k8fY
 0FFnZ4IwIXxoRa9bC1gVK3pv1tUn21xFRZ3doBelGMyHzoV7wSy7cOV3GZDBbVUVsvOQ
 o47g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:to:from:subject:message-id:date:user-agent
 :mime-version;
 bh=O3fion1EOyZ9beuYU8i9Wtcxc9+T+SxV/d4hqgVCPLQ=;
 b=HqYnBuePKIl2DaetXr3LDC7Mo5b03i+dDA1bbBMaqU2JJKsR0Imyus8ywbb/NiMQ4b
 CdtQ/ayuhYq0yZo+gEK578oDoWxXf5ao9sxlP1c2lAygB90iyYvWGZxp+NWyW5bNqSKR
 9KBSyZESFM8d4N5U+FXhOc9WoQuHNFPE7x+rSX++78udkQcn+DT2cda17SfyHEVScpS4
 EBm6Ppxxu2/7AnhBpV6DwdY2eBYgjXVLCqQ/9cXhnTNnROewnG1BmAcScwsFUm57yuMI
 Cn+QCdq6dipEJiDzErkNMk01ri5o14b5qMad1UeJIs+NMnTzINIYPP6Q2lfQH42HBZjE
 Tt/w==
X-Gm-Message-State: ALQs6tCFfWEJkkvt+qJmaEJhRcRw20RAp945wRbGk6+VOtCakDH6bpr9
 Jsb910yHmZKzwncxG0AGRf9P/yD0Lgc=
X-Google-Smtp-Source: AIpwx4+g0KxQOp5gxK5kSm0xVVkh8RDdeKd23xDvOi/I6pHGrmUezHOKlYC+kf6exxZcjqPQ48I6KQ==
X-Received: by 2002:a19:9553:: with SMTP id
 x80-v6mr15263350lfd.74.1523914693377; 
 Mon, 16 Apr 2018 14:38:13 -0700 (PDT)
Received: from [192.168.0.3] ([5.142.206.109])
 by smtp.gmail.com with ESMTPSA id s9-v6sm16543lfk.28.2018.04.16.14.38.12
 for <bug-diffutils@HIDDEN>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 16 Apr 2018 14:38:12 -0700 (PDT)
To: bug-diffutils@HIDDEN
From: Keepun <keepun@HIDDEN>
Subject: Why is there no full support for Unicode?
Message-ID: <7f6138f2-dec1-034d-7414-7c9749315291@HIDDEN>
Date: Tue, 17 Apr 2018 00:38:10 +0300
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="------------1271067F6861B3321FDA72A8"
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Mon, 16 Apr 2018 18:01:20 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

This is a multi-part message in MIME format.
--------------1271067F6861B3321FDA72A8
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Why is there no full support for Unicode?

Set the encoding using BOM.

The status of the binary file should be given only after checking 0x00 
characters.

BOM is part of the Unicode standard. 
http://www.unicode.org/faq/utf_bom.html#bom4

Files with encoding greater than 8 bits without BOM at the beginning can 
be immediately identified as binary.

My function in C#:

/// <summary>
/// </summary>
/// <param name="stream"></param>
/// <returns>null - binary</returns>
public static Encoding GetEncodingStream(Stream stream)
{
     BinaryReader bin = new BinaryReader(stream);
     byte[] bom = new byte[4];
     bin.BaseStream.Seek(0, SeekOrigin.Begin);
     bin.BaseStream.Read(bom, 0, bom.Length);
     bin.BaseStream.Seek(0, SeekOrigin.Begin);
     if (bom[0] == 0x00 && bom[1] == 0x00 && bom[2] == 0xFE && bom[3] == 
0xFF) {
         return new UTF32Encoding(true, true); // UTF-32, big-endian
     } else if (bom[0] == 0xFE && bom[1] == 0xFF) {
         return new UnicodeEncoding(true, true); // UTF-16, big-endian
     } else if (bom[0] == 0xFF && bom[1] == 0xFE) {
         if (bom[2] == 0x00 && bom[2] == 0x00) {
             return new UTF32Encoding(false, true); // UTF-32, little-endian
         } else {
             return new UnicodeEncoding(false, true); // UTF-16, 
little-endian
         }
     } else if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) {
         return new UTF8Encoding(true);
     } else {
         bool binary = false;
         long fsize = bin.BaseStream.Length;
         if (fsize > 100000) {
             fsize = 100000;
         }
         byte[] bts = new byte[fsize];
         bin.BaseStream.Seek(0, SeekOrigin.Begin);
         bin.BaseStream.Read(bts, 0, (int)fsize);
         bin.BaseStream.Seek(0, SeekOrigin.Begin);
         for (int x = 0; x < fsize; x++) {
             if (bts[x] == 0) {
                 binary = true;
                 break;
             }
         }
         if (binary) {
             return null;
         }

         return Encoding.Default;
     }
}


--------------1271067F6861B3321FDA72A8
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <span id="result_box" class="" lang="en"><span class="">Why is there
        no full support for Unicode?</span><br>
      <br>
      <span class="">Set the encoding using BOM.</span><br>
      <br>
      <span class="">The status of the binary file should be given only
        after checking 0x00 characters.</span><br>
      <br>
      <span class="">BOM is part of the Unicode standard.</span> <span><a class="moz-txt-link-freetext" href="http://www.unicode.org/faq/utf_bom.html#bom4">http://www.unicode.org/faq/utf_bom.html#bom4</a></span><br>
      <br>
      <span class="">Files with encoding greater than 8 bits without BOM
        at the beginning can be immediately identified as binary.</span><br>
      <br>
      <span class="">My function in C#:<br>
        <br>
      </span></span>/// &lt;summary&gt;<br>
    /// &lt;/summary&gt;<br>
    /// &lt;param name="stream"&gt;&lt;/param&gt;<br>
    /// &lt;returns&gt;null - binary&lt;/returns&gt;<br>
    public static Encoding GetEncodingStream(Stream stream)<br>
    {<br>
        BinaryReader bin = new BinaryReader(stream);<br>
        byte[] bom = new byte[4];<br>
        bin.BaseStream.Seek(0, SeekOrigin.Begin);<br>
        bin.BaseStream.Read(bom, 0, bom.Length);<br>
        bin.BaseStream.Seek(0, SeekOrigin.Begin);<br>
        if (bom[0] == 0x00 &amp;&amp; bom[1] == 0x00 &amp;&amp; bom[2]
    == 0xFE &amp;&amp; bom[3] == 0xFF) {<br>
            return new UTF32Encoding(true, true); // UTF-32, big-endian<br>
        } else if (bom[0] == 0xFE &amp;&amp; bom[1] == 0xFF) {<br>
            return new UnicodeEncoding(true, true); // UTF-16,
    big-endian<br>
        } else if (bom[0] == 0xFF &amp;&amp; bom[1] == 0xFE) {<br>
            if (bom[2] == 0x00 &amp;&amp; bom[2] == 0x00) {<br>
                return new UTF32Encoding(false, true); // UTF-32,
    little-endian<br>
            } else {<br>
                return new UnicodeEncoding(false, true); // UTF-16,
    little-endian<br>
            }<br>
        } else if (bom[0] == 0xEF &amp;&amp; bom[1] == 0xBB &amp;&amp;
    bom[2] == 0xBF) {<br>
            return new UTF8Encoding(true);<br>
        } else {<br>
            bool binary = false;<br>
            long fsize = bin.BaseStream.Length;<br>
            if (fsize &gt; 100000) {<br>
                fsize = 100000;<br>
            }<br>
            byte[] bts = new byte[fsize];<br>
            bin.BaseStream.Seek(0, SeekOrigin.Begin);<br>
            bin.BaseStream.Read(bts, 0, (int)fsize);<br>
            bin.BaseStream.Seek(0, SeekOrigin.Begin);<br>
            for (int x = 0; x &lt; fsize; x++) {<br>
                if (bts[x] == 0) {<br>
                    binary = true;<br>
                    break;<br>
                }<br>
            }<br>
            if (binary) {<br>
                return null;<br>
            }<br>
    <br>
            return Encoding.Default;<br>
        }<br>
    }<br>
    <br>
  </body>
</html>

--------------1271067F6861B3321FDA72A8--




Acknowledgement sent to Keepun <keepun@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-diffutils@HIDDEN. Full text available.
Report forwarded to bug-diffutils@HIDDEN:
bug#31185; Package diffutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Tue, 17 Apr 2018 21:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.