GNU bug report logs -
#60618
unicode characters are not identified as such for \w and \b with -P
Previous Next
Reported by: Carlo Arenas <carenas <at> gmail.com>
Date: Sat, 7 Jan 2023 03:49:01 UTC
Severity: normal
Merged with 60621
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60618 in the body.
You can then email your comments to 60618 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#60618
; Package
grep
.
(Sat, 07 Jan 2023 03:49:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Carlo Arenas <carenas <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sat, 07 Jan 2023 03:49:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Reported to PCRE[1] with mention of GNU grep being also affected.
[1] https://github.com/PCRE2Project/pcre2/issues/185
[0001-pcre-use-UCP-in-UTF-mode.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#60618
; Package
grep
.
(Sat, 07 Jan 2023 07:30:03 GMT)
Full text and
rfc822 format available.
Message #8 received at 60618 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> Reported to PCRE[1] with mention of GNU grep being also affected.
>
> [1] https://github.com/PCRE2Project/pcre2/issues/185
Yikes. This is a big deal.
Thank you for the patch and added test.
I made a tiny comment tweak and this test logic change that was
required to make the new test pass with the fixed version.
-grep -Po 'r\w' in > out && fail=1
+grep -Po 'r\w' in > out || fail=1
Also, make syntax-check required to change e.g.,
-compare out exp || fail=1
+compare exp out || fail=1
Every bug fix needs a NEWS entry, so I added this:
With -P, some non-ASCII UTF8 characters were not recognized as
word-constituent due to our omission of the PCRE_UCP flag. E.g.,
given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
After the fix, it prints the correct results: "rú:ú".
Finally, I expanded the ChangeLog entry and gave credit where due.
I'll push this tomorrow:
[grep-pcre-fix.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#60618
; Package
grep
.
(Sat, 07 Jan 2023 07:38:04 GMT)
Full text and
rfc822 format available.
Message #11 received at 60618 <at> debbugs.gnu.org (full text, mbox):
On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > Reported to PCRE[1] with mention of GNU grep being also affected.
> >
> > [1] https://github.com/PCRE2Project/pcre2/issues/185
>
> Yikes. This is a big deal.
> Thank you for the patch and added test.
> I made a tiny comment tweak and this test logic change that was
> required to make the new test pass with the fixed version.
>
> -grep -Po 'r\w' in > out && fail=1
> +grep -Po 'r\w' in > out || fail=1
>
> Also, make syntax-check required to change e.g.,
>
> -compare out exp || fail=1
> +compare exp out || fail=1
>
> Every bug fix needs a NEWS entry, so I added this:
>
> With -P, some non-ASCII UTF8 characters were not recognized as
> word-constituent due to our omission of the PCRE_UCP flag. E.g.,
> given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
> this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
> After the fix, it prints the correct results: "rú:ú".
>
> Finally, I expanded the ChangeLog entry and gave credit where due.
>
> I'll push this tomorrow:
Must also mention Karl Pettersson in the ChangeLog:
pcre: use UCP in UTF mode
This fixes a serious bug affecting word-boundary and word-constituent regular
expressions when the desired match involves non-ASCII UTF8 characters.
* src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
* tests/pcre-utf8-w: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention this.
Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185
This bug was present from grep-2.5, when --perl-regexp (-P) support was added.
Merged 60618 60621.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sat, 07 Jan 2023 22:56:02 GMT)
Full text and
rfc822 format available.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Sun, 08 Jan 2023 02:30:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Carlo Arenas <carenas <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 08 Jan 2023 02:30:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 60618-done <at> debbugs.gnu.org (full text, mbox):
On Fri, Jan 6, 2023 at 11:37 PM Jim Meyering <jim <at> meyering.net> wrote:
> On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <jim <at> meyering.net> wrote:
> > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> > > Reported to PCRE[1] with mention of GNU grep being also affected.
> > >
> > > [1] https://github.com/PCRE2Project/pcre2/issues/185
> >
> > Yikes. This is a big deal.
> > Thank you for the patch and added test.
I've also added the new names to THANKS.in and pushed this:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Sun, 08 Jan 2023 02:30:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Karl Pettersson <karl.pettersson <at> klpn.se>
:
bug acknowledged by developer.
(Sun, 08 Jan 2023 02:30:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 05 Feb 2023 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 81 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.