GNU bug report logs - #28255
grep erroneously skips Microsoft UTF-8 text files as being binary

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: Simon <ixlr82c@HIDDEN>; Done: Paul Eggert <eggert@HIDDEN>; Maintainer for grep is bug-grep@HIDDEN.
bug closed, send any further explanations to 28255 <at> debbugs.gnu.org and Simon <ixlr82c@HIDDEN> Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 28255 <at> debbugs.gnu.org:


Received: (at 28255) by debbugs.gnu.org; 28 Aug 2017 00:18:58 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Aug 27 20:18:58 2017
Received: from localhost ([127.0.0.1]:58429 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dm7la-0007lj-7Z
	for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 20:18:58 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39362)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1dm7lY-0007lV-63
 for 28255 <at> debbugs.gnu.org; Sun, 27 Aug 2017 20:18:56 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0D0D216091B;
 Sun, 27 Aug 2017 17:18:49 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id y0uW3O9SdZXt; Sun, 27 Aug 2017 17:18:48 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4561C160921;
 Sun, 27 Aug 2017 17:18:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id WqhC622FwukM; Sun, 27 Aug 2017 17:18:48 -0700 (PDT)
Received: from [192.168.1.9] (unknown [47.153.184.153])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 23E81160918;
 Sun, 27 Aug 2017 17:18:48 -0700 (PDT)
Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as
 being binary
To: Simon <ixlr82c@HIDDEN>
References: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@HIDDEN>
 <80b5a5bd-7b47-74b8-01b4-b681d8cc12ee@HIDDEN>
 <148439f0-7616-e9bd-9ccd-fe114e6ab602@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <ba7ffd59-c820-8690-4a44-d77c10481446@HIDDEN>
Date: Sun, 27 Aug 2017 17:18:47 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <148439f0-7616-e9bd-9ccd-fe114e6ab602@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 28255
Cc: 28255 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

Simon wrote:
> Sorry my description was slightly ambiguous.  I should not have said
> skip so much as treats the file as binary and does not find a match
> because each character takes 2 octets as per utf-8.
> 
> $ mkdir tmp
> $ cd tmp
> $
> $ printf
> '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
> $ printf 'test2\r\n' >2.txt
> $
> $ hexdump -C 1.txt
> 00000000  ff fe 74 00 65 00 73 00  74 00 31 00 0d 00 0a 00
> |..t.e.s.t.1.....|
> 00000010
> $ hexdump -C 2.txt
> 00000000  74 65 73 74 32 0d 0a                              |test2..|
> 00000007
> $
> $ grep --include=*.txt test *
> 2.txt:test2
> $
> 
> I've made the two files as they appear on a Windows system (since lots
> of us move lots of files between operating systems).  As you can see,
> the "1.txt" is skipped because the characters are encoded two octets per
> byte.
> 
> As an example that "1.txt" is a valid Windows text file, if you edit
> "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
> beginning and switch to UTF-8 encoding, and preserve it upon saving.
> 
> That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
> format for Windows text files.  (I can only confirm Win 7 or higher.)
> 
> I guess this should really be considered a feature, not a bug.
> 
> Similar happens for Cygwin grep running under windows.

You're right. grep and most other GNU tools do not support UTF-16. You can use 
the 'recode' command to convert to UTF-8, which grep does support.




Information forwarded to bug-grep@HIDDEN:
bug#28255; Package grep. Full text available.

Message received at 28255 <at> debbugs.gnu.org:


Received: (at 28255) by debbugs.gnu.org; 27 Aug 2017 21:47:37 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Aug 27 17:47:37 2017
Received: from localhost ([127.0.0.1]:58371 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dm5P7-0004Q9-Bv
	for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:47:37 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60566)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1dm5P5-0004Pw-3L
 for 28255 <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:47:36 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7D70E160938;
 Sun, 27 Aug 2017 14:47:29 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id LpQBBWyLRyoK; Sun, 27 Aug 2017 14:47:28 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id C0F0C16093C;
 Sun, 27 Aug 2017 14:47:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id lecrfPQLyJDk; Sun, 27 Aug 2017 14:47:28 -0700 (PDT)
Received: from [192.168.1.9] (unknown [47.153.184.153])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 9A331160872;
 Sun, 27 Aug 2017 14:47:28 -0700 (PDT)
Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as
 being binary
To: Simon <ixlr82c@HIDDEN>, 28255 <at> debbugs.gnu.org
References: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <80b5a5bd-7b47-74b8-01b4-b681d8cc12ee@HIDDEN>
Date: Sun, 27 Aug 2017 14:47:28 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 28255
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

Simon wrote:
> Windows text files can start with a byte order mark of U+FEFF and then
> be encoded in UTF-8.  These are skipped as being binary files.

I can't reproduce this problem on Fedora 26 x86-64. Here's how I tried:

$ printf '\357\273\277x\n' >t
$ LC_ALL=C grep x t | od -c
0000000 357 273 277   x  \n
0000005

To help us diagnose the problem, please send a simple, self-contained example, 
and mention your platform.




Information forwarded to bug-grep@HIDDEN:
bug#28255; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 27 Aug 2017 21:23:52 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Aug 27 17:23:52 2017
Received: from localhost ([127.0.0.1]:58355 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dm527-0003rz-RF
	for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:23:52 -0400
Received: from eggs.gnu.org ([208.118.235.92]:57602)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sm-0003dt-KL
 for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:14:13 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sg-0006dV-Ef
 for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:14:07 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:47512)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sg-0006dR-Bh
 for submit <at> debbugs.gnu.org; Sun, 27 Aug 2017 17:14:06 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48573)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sf-0006ci-GO
 for bug-grep@HIDDEN; Sun, 27 Aug 2017 17:14:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sc-0006cO-6H
 for bug-grep@HIDDEN; Sun, 27 Aug 2017 17:14:05 -0400
Received: from pmta31.teksavvy.com ([76.10.157.38]:43933)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71)
 (envelope-from <prvs=405635429=ixlr82c@HIDDEN>)
 id 1dm4sc-0006Qe-1K
 for bug-grep@HIDDEN; Sun, 27 Aug 2017 17:14:02 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: =?us-ascii?q?A2H0AQC+NKNZ/2mYF4cNUBwBAQQBAQoBA?=
 =?us-ascii?q?YlKmm0BAQEBAQEGgQiYWhyCQIJhhEYBAgEBAQEBAgOGUoELAiYCSwEgCAEBiiC?=
 =?us-ascii?q?xVWuCJ4hXgy+BDYIdgwmCKisLiDOCR4JCHwWgYwGDDIggixQBggCHRYcolj2BZ?=
 =?us-ascii?q?VMkhSYBAQEHAgGCYotfAQEB?=
X-IPAS-Result: =?us-ascii?q?A2H0AQC+NKNZ/2mYF4cNUBwBAQQBAQoBAYlKmm0BAQEBAQE?=
 =?us-ascii?q?GgQiYWhyCQIJhhEYBAgEBAQEBAgOGUoELAiYCSwEgCAEBiiCxVWuCJ4hXgy+BD?=
 =?us-ascii?q?YIdgwmCKisLiDOCR4JCHwWgYwGDDIggixQBggCHRYcolj2BZVMkhSYBAQEHAgG?=
 =?us-ascii?q?CYotfAQEB?=
X-IronPort-AV: E=Sophos;i="5.41,438,1498536000"; 
   d="scan'208";a="2563197"
Received: from 135-23-152-105.cpe.pppoe.ca (HELO [192.168.1.148])
 ([135.23.152.105])
 by smtp.teksavvy.com with ESMTP/TLS/DHE-RSA-AES128-SHA;
 27 Aug 2017 17:13:35 -0400
To: bug-grep@HIDDEN
From: Simon <ixlr82c@HIDDEN>
Subject: grep erroneously skips Microsoft UTF-8 text files as being binary
Message-ID: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@HIDDEN>
Date: Sun, 27 Aug 2017 17:13:34 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Sun, 27 Aug 2017 17:23:51 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

Windows text files can start with a byte order mark of U+FEFF and then
be encoded in UTF-8.  These are skipped as being binary files.





Acknowledgement sent to Simon <ixlr82c@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#28255; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Tue, 31 Dec 2019 20:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.