GNU bug report logs - #72246
Possible PCRE bug in grep 3.11

Package: grep;

Date: Mon, 22 Jul 2024 18:26:01 UTC

Severity: normal

To reply to this bug, email your comments to 72246 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#72246; Package grep. (Mon, 22 Jul 2024 18:26:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to gdg <at> zplane.com:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 22 Jul 2024 18:26:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Glenn Golden <gdg <at> zplane.com>
To: bug-grep <at> gnu.org
Subject: Possible PCRE bug in grep 3.11
Date: Mon, 22 Jul 2024 12:25:36 -0600

Grep 3.11 doesn't seem to behave as expected with some range-based PCREs.
See attached minimal example, comparing grep 3.11 to pcregrep 8.45.
(The latter behaves as I had thought 'grep -P' ought to, but maybe
I'm wrong on that.)

This may be related to

    https://lists.gnu.org/archive/html/grep-devel/2023-03/msg00017.html

which references a regression in 3.10.

Figured it was worthwhile to report even it may be a duplicate. 

Version info: Arch64 linux, kernel 6.1.68, commodity x86-64 laptop.

- Glenn Golden


========================== BEGIN INLINE ATTACHMENT =========================
#!/usr/bin/bash

#
# String containing 3 octets >= 0x80:
#
str=$(printf "begin\xe2\x80\x99end")

#
# grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
# and exits with 1, indicating no match.
#
printf "Using grep 3.11:\n"
printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]' 
printf "exit value = $?\n";

printf "\n"

#
# pcregrep 8.45 behaves as I thought 'grep -P' ought to:
#
printf "Using pcregrep 8.45:\n"
printf "${str}\n" | pcregrep  --color=auto -e '[\x80-\xFF]' 
printf "exit value = $?\n";
========================== END INLINE ATTACHMENT =========================

Information forwarded to bug-grep <at> gnu.org:
bug#72246; Package grep. (Mon, 22 Jul 2024 19:01:02 GMT) Full text and rfc822 format available.

Message #8 received at 72246 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: gdg <at> zplane.com
Cc: 72246 <at> debbugs.gnu.org
Subject: Re: bug#72246: Possible PCRE bug in grep 3.11
Date: Mon, 22 Jul 2024 12:00:21 -0700

On 2024-07-22 11:25, Glenn Golden wrote:
> str=$(printf "begin\xe2\x80\x99end")
> 
> #
> # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
> # and exits with 1, indicating no match.
> #
> printf"Using grep 3.11:\n"
> printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'

This asks 'grep' to output all lines containing characters in the range 
\x80 through \xFF. In a single-byte locale this matches any line 
containing a byte in that range (i.e., any byte with the top bit set), 
and 'grep' will output the line and exit with status zero.

However, in a UTF-8 locale this will match any line containing the 
characters U+0080 (a nameless control character) through U+00FF (LATIN 
SMALL LETTER Y WITH DIAERESIS, or "ÿ"). Because the bytes E2, 80, 99 in 
'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so 
grep doesn't output anything and exits with status 1.

In short, to get the behavior your want, put LC_ALL="C" in the locale.

If pcregrep finds a match in a UTF-8 locale then that would appear to be 
a bug in pcregrep; you might report it to the pcregrep maintainer.

Information forwarded to bug-grep <at> gnu.org:
bug#72246; Package grep. (Mon, 22 Jul 2024 19:25:02 GMT) Full text and rfc822 format available.

Message #11 received at 72246 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Golden <gdg <at> zplane.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 72246 <at> debbugs.gnu.org
Subject: Re: bug#72246: Possible PCRE bug in grep 3.11
Date: Mon, 22 Jul 2024 13:24:26 -0600

Paul Eggert <eggert <at> cs.ucla.edu> [2024-07-22 12:00:21 -0700]:
> On 2024-07-22 11:25, Glenn Golden wrote:
> > str=$(printf "begin\xe2\x80\x99end")
> > 
> > #
> > # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
> > # and exits with 1, indicating no match.
> > #
> > printf"Using grep 3.11:\n"
> > printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'
> 
> This asks 'grep' to output all lines containing characters in the range \x80
> through \xFF. In a single-byte locale this matches any line containing a
> byte in that range (i.e., any byte with the top bit set), and 'grep' will
> output the line and exit with status zero.
> 
> However, in a UTF-8 locale this will match any line containing the
> characters U+0080 (a nameless control character) through U+00FF (LATIN SMALL
> LETTER Y WITH DIAERESIS, or "ÿ"). Because the bytes E2, 80, 99 in 'str'
> represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so grep
> doesn't output anything and exits with status 1.
> 

Ahhhhhhhhhhh... ok, got it, thanks for the explanation.  I had not realized
that even literal octet-like specifications (e.g. \xNN) get 'promoted' (so
to speak) to the underlying code points when interpreted in UTF-8 locales.

> 
> If pcregrep finds a match in a UTF-8 locale then that would appear to be a
> bug in pcregrep; you might report it to the pcregrep maintainer.
>

In looking just now at the 'pcre' package (which contains pcregrep)
it seems that it is now listed as 'deprecated' in the Arch package list,
so probably not worth reporting.

In any case, thanks for the explanation, and sorry for the noise.

- Glenn

This bug report was last modified 1 year and 3 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #72246 Possible PCRE bug in grep 3.11

GNU bug report logs - #72246
Possible PCRE bug in grep 3.11