GNU bug report logs - #60697
GNU grep mishandles \b near encoding errors

Package: grep;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Mon, 9 Jan 2023 23:01:01 UTC

Severity: normal

To reply to this bug, email your comments to 60697 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#60697; Package grep. (Mon, 09 Jan 2023 23:01:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Eggert <eggert <at> cs.ucla.edu>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 09 Jan 2023 23:01:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: GNU grep mishandles \b near encoding errors
Date: Mon, 9 Jan 2023 15:00:15 -0800

Here's a shell session illustrating the problem on Fedora 37, which has 
GNU grep 3.7. The same bug is still in bleeding-edge GNU grep.

  $ export LC_ALL=en_US.utf8
  $ printf '\300\n' | grep '\b'
  grep: (standard input): binary file matches
  $ printf '\300\n' | grep -P '\b'
  $

Plain grep finds a word boundary in the input even though the input 
contains no words (just an encoding error). 'grep -P' does the right thing.

The underlying issue is in the glibc regex code so the fix should be in 
glibc / Gnulib, but I thought I'd report it here before I forgot it.

Information forwarded to bug-grep <at> gnu.org:
bug#60697; Package grep. (Thu, 12 Jan 2023 06:05:01 GMT) Full text and rfc822 format available.

Message #8 received at 60697 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 60697 <at> debbugs.gnu.org
Subject: Re: bug#60697: GNU grep mishandles \b near encoding errors
Date: Wed, 11 Jan 2023 22:03:52 -0800

On Mon, Jan 9, 2023 at 10:16 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Here's a shell session illustrating the problem on Fedora 37, which has
> GNU grep 3.7. The same bug is still in bleeding-edge GNU grep.
>
>    $ export LC_ALL=en_US.utf8
>    $ printf '\300\n' | grep '\b'
>    grep: (standard input): binary file matches
>    $ printf '\300\n' | grep -P '\b'
>    $
>
> Plain grep finds a word boundary in the input even though the input
> contains no words (just an encoding error). 'grep -P' does the right thing.
>
> The underlying issue is in the glibc regex code so the fix should be in
> glibc / Gnulib, but I thought I'd report it here before I forgot it.

Thanks! While this would definitely be nice to fix before the release
(in the next week or so), it's enough of a corner case that I wouldn't
feel bad releasing without a fix.

For the record, this problem first arose in grep-2.19.

This bug report was last modified 2 years and 351 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60697 GNU grep mishandles \b near encoding errors

GNU bug report logs - #60697
GNU grep mishandles \b near encoding errors