GNU bug report logs -
#78276
grep on file with 0xF3 byte in utf-8 locale
Previous Next
To reply to this bug, email your comments to 78276 AT debbugs.gnu.org.
There is no need to reopen the bug first.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#78276
; Package
grep
.
(Tue, 06 May 2025 07:39:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Arkadiusz Miśkiewicz <arekm <at> maven.pl>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Tue, 06 May 2025 07:39:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi.
I was trying to grep logs for some mail log entries and spammer used
0xF3 byte to try to hide / trick things. For grep it looks like this:
$ printf 'a\xF3bcdefgh' > x2
$ LC_ALL=C.UTF-8 grep 'a.*h' x2
$
$ LC_ALL=C grep 'a.*h' x2
abcdefgh
$ LC_ALL=C.UTF-8 grep -a 'a.*h' x2
$
[arekm <at> ixion ~]$ LC_ALL=C grep -a 'a.*h' x2
abcdefgh
Is that expected behavior, no binary file warning and no matching with
utf-8 locale, even with -a? AFAIK that's not correct utf-8 sequence.
$ grep --version x2
grep (GNU grep) 3.12
Copyright (C) 2025 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
grep -P uses PCRE2 10.45 2025-02-05
--
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Tue, 06 May 2025 09:13:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Arkadiusz Miśkiewicz <arekm <at> maven.pl>
:
bug acknowledged by developer.
(Tue, 06 May 2025 09:13:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 78276-done <at> debbugs.gnu.org (full text, mbox):
On 2025-05-06 00:37, Arkadiusz Miśkiewicz via Bug reports for GNU grep
wrote:
> Is that expected behavior, no binary file warning and no matching with
> utf-8 locale, even with -a?
It's allowed behavior, as '.' need not match encoding errors.[1] Also,
'grep' need not diagnose encoding errors that don't harm the output.[2]
As you mentioned in your email, using LC_ALL=C should let '.' match any
byte, so that should let you do what you want.
[1]:
https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html
[2]:
https://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html
This bug report was last modified 8 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.