GNU bug report logs - #78276
grep on file with 0xF3 byte in utf-8 locale

Previous Next

Package: grep;

Reported by: Arkadiusz Miśkiewicz <arekm <at> maven.pl>

Date: Tue, 6 May 2025 07:39:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

To reply to this bug, email your comments to 78276 AT debbugs.gnu.org.
There is no need to reopen the bug first.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#78276; Package grep. (Tue, 06 May 2025 07:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Arkadiusz Miśkiewicz <arekm <at> maven.pl>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 06 May 2025 07:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Arkadiusz Miśkiewicz <arekm <at> maven.pl>
To: bug-grep <at> gnu.org
Subject: grep on file with 0xF3 byte in utf-8 locale
Date: Tue, 6 May 2025 09:37:36 +0200
Hi.

I was trying to grep logs for some mail log entries and spammer used 
0xF3 byte to try to hide / trick things. For grep it looks like this:

$ printf 'a\xF3bcdefgh' > x2

$ LC_ALL=C.UTF-8 grep 'a.*h' x2
$

$ LC_ALL=C grep 'a.*h' x2
abcdefgh

$ LC_ALL=C.UTF-8 grep -a 'a.*h' x2
$

[arekm <at> ixion ~]$ LC_ALL=C grep -a 'a.*h' x2
abcdefgh


Is that expected behavior, no binary file warning and no matching with 
utf-8 locale, even with -a? AFAIK that's not correct utf-8 sequence.


$ grep --version x2
grep (GNU grep) 3.12
Copyright (C) 2025 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.

grep -P uses PCRE2 10.45 2025-02-05
-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )





Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Tue, 06 May 2025 09:13:01 GMT) Full text and rfc822 format available.

Notification sent to Arkadiusz Miśkiewicz <arekm <at> maven.pl>:
bug acknowledged by developer. (Tue, 06 May 2025 09:13:02 GMT) Full text and rfc822 format available.

Message #10 received at 78276-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Arkadiusz Miśkiewicz <arekm <at> maven.pl>
Cc: 78276-done <at> debbugs.gnu.org
Subject: Re: bug#78276: grep on file with 0xF3 byte in utf-8 locale
Date: Tue, 6 May 2025 02:12:25 -0700
On 2025-05-06 00:37, Arkadiusz Miśkiewicz via Bug reports for GNU grep 
wrote:
> Is that expected behavior, no binary file warning and no matching with 
> utf-8 locale, even with -a?

It's allowed behavior, as '.' need not match encoding errors.[1] Also, 
'grep' need not diagnose encoding errors that don't harm the output.[2]

As you mentioned in your email, using LC_ALL=C should let '.' match any 
byte, so that should let you do what you want.

[1]: 
https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html
[2]: 
https://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html




This bug report was last modified 8 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.