GNU bug report logs -
#28255
grep erroneously skips Microsoft UTF-8 text files as being binary
Previous Next
Reported by: Simon <ixlr82c <at> teksavvy.com>
Date: Sun, 27 Aug 2017 21:24:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 28255 in the body.
You can then email your comments to 28255 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#28255
; Package
grep
.
(Sun, 27 Aug 2017 21:24:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Simon <ixlr82c <at> teksavvy.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 27 Aug 2017 21:24:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Windows text files can start with a byte order mark of U+FEFF and then
be encoded in UTF-8. These are skipped as being binary files.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#28255
; Package
grep
.
(Sun, 27 Aug 2017 21:48:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 28255 <at> debbugs.gnu.org (full text, mbox):
Simon wrote:
> Windows text files can start with a byte order mark of U+FEFF and then
> be encoded in UTF-8. These are skipped as being binary files.
I can't reproduce this problem on Fedora 26 x86-64. Here's how I tried:
$ printf '\357\273\277x\n' >t
$ LC_ALL=C grep x t | od -c
0000000 357 273 277 x \n
0000005
To help us diagnose the problem, please send a simple, self-contained example,
and mention your platform.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#28255
; Package
grep
.
(Mon, 28 Aug 2017 00:19:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 28255 <at> debbugs.gnu.org (full text, mbox):
Simon wrote:
> Sorry my description was slightly ambiguous. I should not have said
> skip so much as treats the file as binary and does not find a match
> because each character takes 2 octets as per utf-8.
>
> $ mkdir tmp
> $ cd tmp
> $
> $ printf
> '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
> $ printf 'test2\r\n' >2.txt
> $
> $ hexdump -C 1.txt
> 00000000 ff fe 74 00 65 00 73 00 74 00 31 00 0d 00 0a 00
> |..t.e.s.t.1.....|
> 00000010
> $ hexdump -C 2.txt
> 00000000 74 65 73 74 32 0d 0a |test2..|
> 00000007
> $
> $ grep --include=*.txt test *
> 2.txt:test2
> $
>
> I've made the two files as they appear on a Windows system (since lots
> of us move lots of files between operating systems). As you can see,
> the "1.txt" is skipped because the characters are encoded two octets per
> byte.
>
> As an example that "1.txt" is a valid Windows text file, if you edit
> "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
> beginning and switch to UTF-8 encoding, and preserve it upon saving.
>
> That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
> format for Windows text files. (I can only confirm Win 7 or higher.)
>
> I guess this should really be considered a feature, not a bug.
>
> Similar happens for Cygwin grep running under windows.
You're right. grep and most other GNU tools do not support UTF-16. You can use
the 'recode' command to convert to UTF-8, which grep does support.
bug closed, send any further explanations to
28255 <at> debbugs.gnu.org and Simon <ixlr82c <at> teksavvy.com>
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Tue, 31 Dec 2019 19:48:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 29 Jan 2020 12:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 4 years and 86 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.