GNU bug report logs - #60618
unicode characters are not identified as such for \w and \b with -P

Previous Next

Package: grep;

Reported by: Carlo Arenas <carenas <at> gmail.com>

Date: Sat, 7 Jan 2023 03:49:01 UTC

Severity: normal

Merged with 60621

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60618 in the body.
You can then email your comments to 60618 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#60618; Package grep. (Sat, 07 Jan 2023 03:49:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Carlo Arenas <carenas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 07 Jan 2023 03:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: unicode characters are not identified as such for \w and \b with -P
Date: Fri, 6 Jan 2023 19:48:01 -0800
[Message part 1 (text/plain, inline)]
Reported to PCRE[1] with mention of GNU grep being also affected.

[1] https://github.com/PCRE2Project/pcre2/issues/185
[0001-pcre-use-UCP-in-UTF-mode.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#60618; Package grep. (Sat, 07 Jan 2023 07:30:03 GMT) Full text and rfc822 format available.

Message #8 received at 60618 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 60618 <at> debbugs.gnu.org
Subject: Re: bug#60618: unicode characters are not identified as such for \w
 and \b with -P
Date: Fri, 6 Jan 2023 23:28:44 -0800
[Message part 1 (text/plain, inline)]
On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> Reported to PCRE[1] with mention of GNU grep being also affected.
>
> [1] https://github.com/PCRE2Project/pcre2/issues/185

Yikes. This is a big deal.
Thank you for the patch and added test.
I made a tiny comment tweak and this test logic change that was
required to make the new test pass with the fixed version.

-grep -Po 'r\w' in > out && fail=1
+grep -Po 'r\w' in > out || fail=1

Also, make syntax-check required to change e.g.,

-compare out exp || fail=1
+compare exp out || fail=1

Every bug fix needs a NEWS entry, so I added this:

  With -P, some non-ASCII UTF8 characters were not recognized as
  word-constituent due to our omission of the PCRE_UCP flag. E.g.,
  given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
  this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
  After the fix, it prints the correct results: "rú:ú".

Finally, I expanded the ChangeLog entry and gave credit where due.

I'll push this tomorrow:
[grep-pcre-fix.diff (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#60618; Package grep. (Sat, 07 Jan 2023 07:38:04 GMT) Full text and rfc822 format available.

Message #11 received at 60618 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 60618 <at> debbugs.gnu.org
Subject: Re: bug#60618: unicode characters are not identified as such for \w
 and \b with -P
Date: Fri, 6 Jan 2023 23:37:37 -0800
On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > Reported to PCRE[1] with mention of GNU grep being also affected.
> >
> > [1] https://github.com/PCRE2Project/pcre2/issues/185
>
> Yikes. This is a big deal.
> Thank you for the patch and added test.
> I made a tiny comment tweak and this test logic change that was
> required to make the new test pass with the fixed version.
>
> -grep -Po 'r\w' in > out && fail=1
> +grep -Po 'r\w' in > out || fail=1
>
> Also, make syntax-check required to change e.g.,
>
> -compare out exp || fail=1
> +compare exp out || fail=1
>
> Every bug fix needs a NEWS entry, so I added this:
>
>   With -P, some non-ASCII UTF8 characters were not recognized as
>   word-constituent due to our omission of the PCRE_UCP flag. E.g.,
>   given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
>   this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
>   After the fix, it prints the correct results: "rú:ú".
>
> Finally, I expanded the ChangeLog entry and gave credit where due.
>
> I'll push this tomorrow:

Must also mention Karl Pettersson in the ChangeLog:

pcre: use UCP in UTF mode

This fixes a serious bug affecting word-boundary and word-constituent regular
expressions when the desired match involves non-ASCII UTF8 characters.
* src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
* tests/pcre-utf8-w: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention this.
Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185
This bug was present from grep-2.5, when --perl-regexp (-P) support was added.




Merged 60618 60621. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sat, 07 Jan 2023 22:56:02 GMT) Full text and rfc822 format available.

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 08 Jan 2023 02:30:02 GMT) Full text and rfc822 format available.

Notification sent to Carlo Arenas <carenas <at> gmail.com>:
bug acknowledged by developer. (Sun, 08 Jan 2023 02:30:02 GMT) Full text and rfc822 format available.

Message #18 received at 60618-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 60618-done <at> debbugs.gnu.org
Subject: Re: bug#60618: unicode characters are not identified as such for \w
 and \b with -P
Date: Sat, 7 Jan 2023 18:28:49 -0800
On Fri, Jan 6, 2023 at 11:37 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <jim <at> meyering.net> wrote:
> > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > > Reported to PCRE[1] with mention of GNU grep being also affected.
> > >
> > > [1] https://github.com/PCRE2Project/pcre2/issues/185
> >
> > Yikes. This is a big deal.
> > Thank you for the patch and added test.

I've also added the new names to THANKS.in and pushed this:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3




Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 08 Jan 2023 02:30:03 GMT) Full text and rfc822 format available.

Notification sent to Karl Pettersson <karl.pettersson <at> klpn.se>:
bug acknowledged by developer. (Sun, 08 Jan 2023 02:30:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 05 Feb 2023 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 81 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.