GNU bug report logs - #28255
grep erroneously skips Microsoft UTF-8 text files as being binary

Package: grep;

Reported by: Simon <ixlr82c <at> teksavvy.com>

Date: Sun, 27 Aug 2017 21:24:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 28255 in the body.
You can then email your comments to 28255 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#28255; Package grep. (Sun, 27 Aug 2017 21:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Simon <ixlr82c <at> teksavvy.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 27 Aug 2017 21:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Simon <ixlr82c <at> teksavvy.com>
To: bug-grep <at> gnu.org
Subject: grep erroneously skips Microsoft UTF-8 text files as being binary
Date: Sun, 27 Aug 2017 17:13:34 -0400

Windows text files can start with a byte order mark of U+FEFF and then
be encoded in UTF-8.  These are skipped as being binary files.

Information forwarded to bug-grep <at> gnu.org:
bug#28255; Package grep. (Sun, 27 Aug 2017 21:48:02 GMT) Full text and rfc822 format available.

Message #8 received at 28255 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Simon <ixlr82c <at> teksavvy.com>, 28255 <at> debbugs.gnu.org
Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as
 being binary
Date: Sun, 27 Aug 2017 14:47:28 -0700

Simon wrote:
> Windows text files can start with a byte order mark of U+FEFF and then
> be encoded in UTF-8.  These are skipped as being binary files.

I can't reproduce this problem on Fedora 26 x86-64. Here's how I tried:

$ printf '\357\273\277x\n' >t
$ LC_ALL=C grep x t | od -c
0000000 357 273 277   x  \n
0000005

To help us diagnose the problem, please send a simple, self-contained example, 
and mention your platform.

Information forwarded to bug-grep <at> gnu.org:
bug#28255; Package grep. (Mon, 28 Aug 2017 00:19:02 GMT) Full text and rfc822 format available.

Message #11 received at 28255 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Simon <ixlr82c <at> teksavvy.com>
Cc: 28255 <at> debbugs.gnu.org
Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as
 being binary
Date: Sun, 27 Aug 2017 17:18:47 -0700

Simon wrote:
> Sorry my description was slightly ambiguous.  I should not have said
> skip so much as treats the file as binary and does not find a match
> because each character takes 2 octets as per utf-8.
> 
> $ mkdir tmp
> $ cd tmp
> $
> $ printf
> '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
> $ printf 'test2\r\n' >2.txt
> $
> $ hexdump -C 1.txt
> 00000000  ff fe 74 00 65 00 73 00  74 00 31 00 0d 00 0a 00
> |..t.e.s.t.1.....|
> 00000010
> $ hexdump -C 2.txt
> 00000000  74 65 73 74 32 0d 0a                              |test2..|
> 00000007
> $
> $ grep --include=*.txt test *
> 2.txt:test2
> $
> 
> I've made the two files as they appear on a Windows system (since lots
> of us move lots of files between operating systems).  As you can see,
> the "1.txt" is skipped because the characters are encoded two octets per
> byte.
> 
> As an example that "1.txt" is a valid Windows text file, if you edit
> "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
> beginning and switch to UTF-8 encoding, and preserve it upon saving.
> 
> That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
> format for Windows text files.  (I can only confirm Win 7 or higher.)
> 
> I guess this should really be considered a feature, not a bug.
> 
> Similar happens for Cygwin grep running under windows.

You're right. grep and most other GNU tools do not support UTF-16. You can use 
the 'recode' command to convert to UTF-8, which grep does support.

bug closed, send any further explanations to 28255 <at> debbugs.gnu.org and Simon <ixlr82c <at> teksavvy.com> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 31 Dec 2019 19:48:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 29 Jan 2020 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 170 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #28255 grep erroneously skips Microsoft UTF-8 text files as being binary

GNU bug report logs - #28255
grep erroneously skips Microsoft UTF-8 text files as being binary