GNU bug report logs - #62983
workaround PCRE2 bug affecting at least \D and \W

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: Carlo Marcelo Arenas Belón <carenas@HIDDEN>; dated Fri, 21 Apr 2023 02:05:01 UTC; Maintainer for grep is bug-grep@HIDDEN.

Message received at 62983 <at> debbugs.gnu.org:


Received: (at 62983) by debbugs.gnu.org; 29 Apr 2023 06:55:06 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Apr 29 02:55:06 2023
Received: from localhost ([127.0.0.1]:35057 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1pseU6-0001Y9-4K
	for submit <at> debbugs.gnu.org; Sat, 29 Apr 2023 02:55:06 -0400
Received: from mail-lj1-f174.google.com ([209.85.208.174]:45257)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@HIDDEN>) id 1pseU3-0001XZ-G7
 for 62983 <at> debbugs.gnu.org; Sat, 29 Apr 2023 02:55:04 -0400
Received: by mail-lj1-f174.google.com with SMTP id
 38308e7fff4ca-2a8b082d6feso5327931fa.2
 for <62983 <at> debbugs.gnu.org>; Fri, 28 Apr 2023 23:55:03 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1682751297; x=1685343297;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=iNbhEgysq8IS3aW395mPDVpd3MDiYb42t3blmdrCy4E=;
 b=Cjob8Dj91Q1m15td3L4N1z7XfiS2+LZykW9okCK6j3U46uSWlqjRSjleenoMpTQSt7
 fmVDlPHGN9c3EfCb3tTBf7APZ2pfB8zjORmqARXniC+1w+uRnbi9hfsFZokKQnsC22VI
 Ql2ouYooIULfo6tv3X3jWjB5wMpzlaKG7n2mdD0HAHK+7ie9Pa7BiqN/qZh22qG03v8u
 2MqhTrto2jLOREtx6DSzk+pXMEHgw340zrFOZQk/K8EBzCk3IvSVWd8wfJcDBZAhRfAa
 N7+jKrGHUruGIm9bEh5eUFR51NlPYIvFgW3pYsWGl0mgR4aBo2cetw+74qaBX6FTL9oy
 BJ9A==
X-Gm-Message-State: AC+VfDzBEGBzZuFZDLIUvNzB8DcHXR7ZZqk6G7h4DWZmUY+kZBJg6wVh
 zeBepnhwxq98MBB49+7F3FUGYHY1J8uUVF81kzI=
X-Google-Smtp-Source: ACHHUZ5HAqEoOQVSpANvW5Tn8kEwkV1MNGz00xlIOrsZBj+7eDH3ZWv/ZwQEYgE1Z6VC4aKh/SqDYiVKnDNauh0zZUs=
X-Received: by 2002:a2e:8501:0:b0:2a9:f8fd:49ff with SMTP id
 j1-20020a2e8501000000b002a9f8fd49ffmr2175720lji.17.1682751297436; Fri, 28 Apr
 2023 23:54:57 -0700 (PDT)
MIME-Version: 1.0
References: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
 <c82d3567-5dc9-ec84-f656-90e480bd3987@HIDDEN>
 <zwfll3hke4opx3ueoap3xodaxqf4vqjiy5zsknj4ngouohx63v@nd4npghhit3n>
In-Reply-To: <zwfll3hke4opx3ueoap3xodaxqf4vqjiy5zsknj4ngouohx63v@nd4npghhit3n>
From: Jim Meyering <jim@HIDDEN>
Date: Sat, 29 Apr 2023 08:54:44 +0200
Message-ID: <CA+8g5KEvbw1cdJW+wn8fKf8izcE6oVQ=G2XaCoANzNR6s48=Xg@HIDDEN>
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
To: =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@HIDDEN>
Content-Type: multipart/mixed; boundary="000000000000545b5d05fa74110f"
X-Spam-Score: 0.2 (/)
X-Debbugs-Envelope-To: 62983
Cc: Paul Eggert <eggert@HIDDEN>, 62983 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.8 (/)

--000000000000545b5d05fa74110f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Apr 21, 2023 at 10:22=E2=80=AFPM Carlo Marcelo Arenas Bel=C3=B3n
<carenas@HIDDEN> wrote:
> On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> > On 2023-04-20 19:04, Carlo Marcelo Arenas Bel=C3=B3n wrote:
> > > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug =
on
> > > its JIT implementation that results in failure to match for the negat=
ive
> > > perl classes, and seems to be easier to replicate when the matching
> > > character is a multibyte one.
> >
> > Unfortunately that is a little vague. I expect the issue is not limited=
 to
> > \D and \W, as there are other ways to specify negative Perl classes.
>
> Correct, it should also affect at least \S, but hadn't been able to trigg=
er
> it there.
>
> The bug was that an uninitialized value was being used in the JIT code th=
at
> supports the PCRE2_MATCH_INVALID_UTF mode. which is why I said "randomly"=
 in
> the commit message.
>
> If you want to be strict, how about the attached patch instead?
>
> > And if
> > the bug merely seems to be easier to replicate with multibyte character=
s, it
> > sounds like we may have issues even when matching ASCII characters in a
> > UTF-8 locale.
>
> Which the current workaround addresses, since you need both PCRE2_JIT and
> PCRE2_MATCH_INVALID_UTF to trigger it, and the subject encoding is irrele=
vant
> for the logic to decide if PCRE2_MATCH_INVALID_UTF gets enabled or not.
>
> > Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We sh=
ould
> > focus our optimization efforts on future PCRE2 versions, and not worry =
about
> > optimizing earlier versions where optimizations complicate maintenance =
for a
> > declining benefit, and are likely to provoke bugs in older versions tha=
t as
> > time passes will be harder to debug.
>
> Not sure I understand your concern here, but if it is about disabling JIT
> insteed, then the possibility of introducing bugs is even bigger since it
> affects all versions of PCRE2 (not only 10.34 or newer).
>
> > > Alternatively JIT could be disabled instead, but the option selected =
has
> > > less of an impact on performance.
> >
> > Disabling JIT sounds better, as correctness trumps performance. Until t=
he
> > bug is fixed (or at least better-understood so that we have a workaroun=
d we
> > can trust), how about the attached patch instead?
>
> The bug has been fixed already, and will be included in the next release.
> There might be additional changes as spelled in that discussion, and inde=
ed
> the change to the proposed solution proactively helps with one of those.
>
> It is very unlikely, but some systems might include non 0 values on the
> tables for characters over 127 and that might trigger a similar problem t=
hat
> is yet to be fixed.
>
> Carlo
>
> [1] https://github.com/PCRE2Project/pcre2/commit/2c08b619dc973beacc474dcb=
67cda8cd366200ce

Thanks, Carlo.
I've made some small adjustments and tidied up the ChangeLog in the attache=
d.
Hope to push it by Sunday.

There's enough going on via gnulib that I'll likely make yet another
snapshot with the very latest.

Also, there remain solaris sparc and i386 gnulib test failures:

    https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-sparc/b=
uilds/336
      FAIL: test-c-stack.sh
      FAIL: test-year2038

    https://buildfarm.opencsw.org/buildbot/builders/ggrep-solaris10-i386/bu=
ilds/334
      FAIL: test-year2038

--000000000000545b5d05fa74110f
Content-Type: application/octet-stream; name="grep-pcre2.diff"
Content-Disposition: attachment; filename="grep-pcre2.diff"
Content-Transfer-Encoding: base64
Content-ID: <f_lh1mnrd10>
X-Attachment-Id: f_lh1mnrd10

RnJvbSA5Mzk3Yzc0ZmNlODhlZWYxN2RkMDBhN2M3Yjg4OWQwNDk1ZjQ1YjUxIE1vbiBTZXAgMTcg
MDA6MDA6MDAgMjAwMQpGcm9tOiA9P1VURi04P3E/Q2FybG89MjBNYXJjZWxvPTIwQXJlbmFzPTIw
QmVsPUMzPUIzbj89IDxjYXJlbmFzQGdtYWlsLmNvbT4KRGF0ZTogVGh1LCAyMCBBcHIgMjAyMyAx
ODozNzoyMCAtMDcwMApTdWJqZWN0OiBbUEFUQ0hdIHBjcmU6IHdvcmsgYXJvdW5kIGEgUENSRTJf
TUFUQ0hfSU5WQUxJRF9VVEYgYnVnCgpQQ1JFMiBoYXMgYSBidWcgd2hlbiB1c2luZyBQQ1JFMl9N
QVRDSF9JTlZBTElEX1VURjogaXQgd291bGQKc29tZXRpbWVzIGZhaWwgdG8gbWF0Y2ggcGF0dGVy
bnMgdXNpbmcgcGVybCBuZWdhdGl2ZSBjbGFzc2VzCmxpa2UgXFcgYW5kIFxELgoKKiBORVdTIChC
dWcgZml4ZXMpOiBNZW50aW9uIGl0LgoqIHNyYy9wY3JlMnNlYXJjaC5jOiByZXN0cmljIGltcGFj
dCBvZiB0aGUgYnVnCkRvIG5vdCB1c2UgdGhlIHByb2JsZW1hdGljIGZsYWcgd2l0aCBicm9rZW4g
dmVyc2lvbnMgb2YgUENSRTIuCkdlbmVyYXRlIGxvY2FsZSB0YWJsZXMgb25seSBmb3Igc2luZ2xl
LWJ5dGUgbG9jYWxlcy4KKiB0ZXN0cy9NYWtlZmlsZS5hbSAoVEVTVFMpOiBBZGQgdGhlIGZpbGUg
bmFtZQoqIHRlc3RzL3BjcmUtdXRmOC1idWcyMjQ6IE5ldyBmaWxlLCB0byB0ZXN0IGZvciB0aGlz
LgotLS0KIE5FV1MgICAgICAgICAgICAgICAgICAgfCAgNSArKysrKwogc3JjL3BjcmVzZWFyY2gu
YyAgICAgICB8IDIyICsrKysrKysrKysrKysrLS0tLS0tLS0KIHRlc3RzL01ha2VmaWxlLmFtICAg
ICAgfCAgMSArCiB0ZXN0cy9wY3JlLXV0ZjgtYnVnMjI0IHwgMzEgKysrKysrKysrKysrKysrKysr
KysrKysrKysrKysrKwogNCBmaWxlcyBjaGFuZ2VkLCA1MSBpbnNlcnRpb25zKCspLCA4IGRlbGV0
aW9ucygtKQogY3JlYXRlIG1vZGUgMTAwNzU1IHRlc3RzL3BjcmUtdXRmOC1idWcyMjQKCmRpZmYg
LS1naXQgYS9ORVdTIGIvTkVXUwppbmRleCBjMTU3NjRjLi45N2E5MTNjIDEwMDY0NAotLS0gYS9O
RVdTCisrKyBiL05FV1MKQEAgLTE1LDYgKzE1LDExIEBAIEdOVSBncmVwIE5FV1MgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAtKi0gb3V0bGluZSAtKi0KICAgd2hlbiBydW5uaW5n
IG9uIDMyLWJpdCB4ODYgYW5kIEFSTSBob3N0cyB1c2luZyBnbGliYyAyLjM0Ky4KICAgW2J1ZyBp
bnRyb2R1Y2VkIGluIGdyZXAgMy45XQoKKyAgZ3JlcCBubyBsb25nZXIgZmFpbHMgdG8gbWF0Y2gg
cGF0dGVybnMgdXNpbmcgbmVnYXRlZCBwZXJsCisgIGNsYXNzZXMgbGlrZSBcRCBvciBcVyB3aGVu
IGxpbmtlZCB3aXRoIFBDUkUyIDEwLjM0IG9yIG5ld2VyLgorICBbYnVnIGludHJvZHVjZWQgaW4g
Z3JlcCAzLjhdCisKKwogKiogQ2hhbmdlcyBpbiBiZWhhdmlvcgoKICAgZ3JlcCAtLXZlcnNpb24g
bm93IHByaW50cyBhIGxpbmUgZGVzY3JpYmluZyB0aGUgdmVyc2lvbiBvZiBQQ1JFMiBpdCB1c2Vz
LgpkaWZmIC0tZ2l0IGEvc3JjL3BjcmVzZWFyY2guYyBiL3NyYy9wY3Jlc2VhcmNoLmMKaW5kZXgg
ZTg2N2Y0OS4uNjhlYzZkZSAxMDA2NDQKLS0tIGEvc3JjL3BjcmVzZWFyY2guYworKysgYi9zcmMv
cGNyZXNlYXJjaC5jCkBAIC01OCw2ICs1OCw5IEBAIHN0cnVjdCBwY3JlX2NvbXAKICAgLyogVGFi
bGUsIGluZGV4ZWQgYnkgISAoZmxhZyAmIFBDUkUyX05PVEJPTCksIG9mIHdoZXRoZXIgdGhlIGVt
cHR5CiAgICAgIHN0cmluZyBtYXRjaGVzIHdoZW4gdGhhdCBmbGFnIGlzIHVzZWQuICAqLwogICBp
bnQgZW1wdHlfbWF0Y2hbMl07CisKKyAgLyogRmxhZ3MgKi8KKyAgdW5zaWduZWQgYmluYXJ5X3Nh
ZmU6MTsKIH07CgogLyogTWVtb3J5IGFsbG9jYXRpb24gZnVuY3Rpb25zIGZvciBQQ1JFLiAgKi8K
QEAgLTEzMCwxNiArMTMzLDExIEBAIGppdF9leGVjIChzdHJ1Y3QgcGNyZV9jb21wICpwYywgY2hh
ciBjb25zdCAqc3ViamVjdCwgaWR4X3Qgc2VhcmNoX2J5dGVzLAogICAgIH0KIH0KCi0vKiBSZXR1
cm4gdHJ1ZSBpZiBFIGlzIGFuIGVycm9yIGNvZGUgZm9yIGJhZCBVVEYtOCwgYW5kIGlmIHBjcmUy
X21hdGNoCi0gICBjb3VsZCByZXR1cm4gRSBiZWNhdXNlIFBDUkUgbGFja3MgUENSRTJfTUFUQ0hf
SU5WQUxJRF9VVEYuICAqLworLyogUmV0dXJuIHRydWUgaWYgRSBpcyBhbiBlcnJvciBjb2RlIGZv
ciBiYWQgVVRGLTggKi8KIHN0YXRpYyBib29sCiBiYWRfdXRmOF9mcm9tX3BjcmUyIChpbnQgZSkK
IHsKLSNpZmRlZiBQQ1JFMl9NQVRDSF9JTlZBTElEX1VURgotICByZXR1cm4gZmFsc2U7Ci0jZWxz
ZQogICByZXR1cm4gUENSRTJfRVJST1JfVVRGOF9FUlIyMSA8PSBlICYmIGUgPD0gUENSRTJfRVJS
T1JfVVRGOF9FUlIxOwotI2VuZGlmCiB9CgogLyogQ29tcGlsZSB0aGUgLVAgc3R5bGUgUEFUVEVS
TiwgY29udGFpbmluZyBTSVpFIGJ5dGVzIHRoYXQgYXJlCkBAIC0xNTcsNiArMTU1LDcgQEAgUGNv
bXBpbGUgKGNoYXIgKnBhdHRlcm4sIGlkeF90IHNpemUsIHJlZ19zeW50YXhfdCBpZ25vcmVkLCBi
b29sIGV4YWN0KQogICAgID0gcGNyZTJfZ2VuZXJhbF9jb250ZXh0X2NyZWF0ZSAocHJpdmF0ZV9t
YWxsb2MsIHByaXZhdGVfZnJlZSwgTlVMTCk7CiAgIHBjcmUyX2NvbXBpbGVfY29udGV4dCAqY2Nv
bnRleHQgPSBwY3JlMl9jb21waWxlX2NvbnRleHRfY3JlYXRlIChnY29udGV4dCk7CgorICBwYy0+
YmluYXJ5X3NhZmUgPSBmYWxzZTsKICAgaWYgKGxvY2FsZWluZm8ubXVsdGlieXRlKQogICAgIHsK
ICAgICAgIHVpbnQzMl90IHVuaWNvZGU7CkBAIC0xODEsOCArMTgwLDEzIEBAIFBjb21waWxlIChj
aGFyICpwYXR0ZXJuLCBpZHhfdCBzaXplLCByZWdfc3ludGF4X3QgaWdub3JlZCwgYm9vbCBleGFj
dCkKICAgICAgIGZsYWdzIHw9IFBDUkUyX05FVkVSX0JBQ0tTTEFTSF9DOwogI2VuZGlmCiAjaWZk
ZWYgUENSRTJfTUFUQ0hfSU5WQUxJRF9VVEYKKyAgICAgIC8qIHdvcmthcm91bmQgUENSRTIgYnVn
CisgICAgICAgICBodHRwczovL2dpdGh1Yi5jb20vUENSRTJQcm9qZWN0L3BjcmUyL2lzc3Vlcy8y
MjQgKi8KKyNpZiAxMCA8IFBDUkUyX01BSk9SIHx8IChQQ1JFMl9NQUpPUiA9PSAxMCAmJiA0MiA8
IFBDUkUyX01JTk9SKQorICAgICAgcGMtPmJpbmFyeV9zYWZlID0gdHJ1ZTsKICAgICAgIC8qIENv
bnNpZGVyIGludmFsaWQgVVRGLTggYXMgYSBiYXJyaWVyLCBpbnN0ZWFkIG9mIGVycm9yLiAgKi8K
ICAgICAgIGZsYWdzIHw9IFBDUkUyX01BVENIX0lOVkFMSURfVVRGOworI2VuZGlmCiAjZW5kaWYK
ICAgICB9CgpAQCAtMjI2LDcgKzIzMCw5IEBAIFBjb21waWxlIChjaGFyICpwYXR0ZXJuLCBpZHhf
dCBzaXplLCByZWdfc3ludGF4X3QgaWdub3JlZCwgYm9vbCBleGFjdCkKICAgICAgIHNpemUgPSBy
ZV9zaXplOwogICAgIH0KCi0gIHBjcmUyX3NldF9jaGFyYWN0ZXJfdGFibGVzIChjY29udGV4dCwg
cGNyZTJfbWFrZXRhYmxlcyAoZ2NvbnRleHQpKTsKKyAgaWYgKCFsb2NhbGVpbmZvLm11bHRpYnl0
ZSkKKyAgICBwY3JlMl9zZXRfY2hhcmFjdGVyX3RhYmxlcyAoY2NvbnRleHQsIHBjcmUyX21ha2V0
YWJsZXMgKGdjb250ZXh0KSk7CisKICAgcGMtPmNyZSA9IHBjcmUyX2NvbXBpbGUgKChQQ1JFMl9T
UFRSKSBwYXR0ZXJuLCBzaXplLCBmbGFncywKICAgICAgICAgICAgICAgICAgICAgICAgICAgICZl
YywgJmUsIGNjb250ZXh0KTsKICAgaWYgKCFwYy0+Y3JlKQpAQCAtMzEzLDcgKzMxOSw3IEBAIFBl
eGVjdXRlICh2b2lkICp2Y3AsIGNoYXIgY29uc3QgKmJ1ZiwgaWR4X3Qgc2l6ZSwgaWR4X3QgKm1h
dGNoX3NpemUsCgogICAgICAgICAgIGUgPSBqaXRfZXhlYyAocGMsIHN1YmplY3QsIGxpbmVfZW5k
IC0gc3ViamVjdCwKICAgICAgICAgICAgICAgICAgICAgICAgIHNlYXJjaF9vZmZzZXQsIG9wdGlv
bnMpOwotICAgICAgICAgIGlmICghYmFkX3V0ZjhfZnJvbV9wY3JlMiAoZSkpCisgICAgICAgICAg
aWYgKHBjLT5iaW5hcnlfc2FmZSB8fCAhYmFkX3V0ZjhfZnJvbV9wY3JlMiAoZSkpCiAgICAgICAg
ICAgICBicmVhazsKCiAgICAgICAgICAgaWR4X3QgdmFsaWRfYnl0ZXMgPSBwY3JlMl9nZXRfc3Rh
cnRjaGFyIChwYy0+ZGF0YSk7CmRpZmYgLS1naXQgYS90ZXN0cy9NYWtlZmlsZS5hbSBiL3Rlc3Rz
L01ha2VmaWxlLmFtCmluZGV4IDc3MThmMjQuLjliNDQyMmUgMTAwNjQ0Ci0tLSBhL3Rlc3RzL01h
a2VmaWxlLmFtCisrKyBiL3Rlc3RzL01ha2VmaWxlLmFtCkBAIC0xNTUsNiArMTU1LDcgQEAgVEVT
VFMgPQkJCQkJCVwKICAgcGNyZS1qaXRzdGFjawkJCQkJXAogICBwY3JlLW8JCQkJCVwKICAgcGNy
ZS11dGY4CQkJCQlcCisgIHBjcmUtdXRmOC1idWcyMjQJCQkJXAogICBwY3JlLXV0ZjgtdwkJCQkJ
XAogICBwY3JlLXcJCQkJCVwKICAgcGNyZS13eC1iYWNrcmVmCQkJCVwKZGlmZiAtLWdpdCBhL3Rl
c3RzL3BjcmUtdXRmOC1idWcyMjQgYi90ZXN0cy9wY3JlLXV0ZjgtYnVnMjI0Cm5ldyBmaWxlIG1v
ZGUgMTAwNzU1CmluZGV4IDAwMDAwMDAuLmU3ZTBkY2QKLS0tIC9kZXYvbnVsbAorKysgYi90ZXN0
cy9wY3JlLXV0ZjgtYnVnMjI0CkBAIC0wLDAgKzEsMzEgQEAKKyMhL2Jpbi9zaAorIyBFbnN1cmUg
bmVnYXRlZCBwZXJsIGNsYXNzZXMgbWF0Y2ggbXVsdGlieXRlIGNoYXJhY3RlcnMgaW4gVVRGIG1v
ZGUKKyMKKyMgQ29weXJpZ2h0IChDKSAyMDIzIEZyZWUgU29mdHdhcmUgRm91bmRhdGlvbiwgSW5j
LgorIworIyBDb3B5aW5nIGFuZCBkaXN0cmlidXRpb24gb2YgdGhpcyBmaWxlLCB3aXRoIG9yIHdp
dGhvdXQgbW9kaWZpY2F0aW9uLAorIyBhcmUgcGVybWl0dGVkIGluIGFueSBtZWRpdW0gd2l0aG91
dCByb3lhbHR5IHByb3ZpZGVkIHRoZSBjb3B5cmlnaHQKKyMgbm90aWNlIGFuZCB0aGlzIG5vdGlj
ZSBhcmUgcHJlc2VydmVkLgorCisuICIke3NyY2Rpcj0ufS9pbml0LnNoIjsgcGF0aF9wcmVwZW5k
XyAuLi9zcmMKK3JlcXVpcmVfZW5fdXRmOF9sb2NhbGVfCitMQ19BTEw9ZW5fVVMuVVRGLTgKK2V4
cG9ydCBMQ19BTEwKK3JlcXVpcmVfcGNyZV8KKworZWNobyAuIHwgZ3JlcCAtcVAgJygqVVRGKS4n
IDI+L2Rldi9udWxsIFwKKyAgfHwgc2tpcF8gJ1BDUkUgdW5pY29kZSBzdXBwb3J0IGlzIGNvbXBp
bGVkIG91dCcKKworZmFpbD0wCisKKyMgJ8OxJyAoVSswMEYxKQorcHJpbnRmICdcMzAyXDIyMVxu
JyA+IGluIHx8IGZyYW1ld29ya19mYWlsdXJlXworZ3JlcCAtUCAnXEQnIGluID4gb3V0IHx8IGZh
aWw9MQorY29tcGFyZSBpbiBvdXQgfHwgZmFpbD0xCisKKyMg4oCc8J2EnuKAnSAoVSsxRDExRSkK
K3ByaW50ZiAnXDM2MFwyMzVcMjA0XDIzNlxuJyA+IGluIHx8IGZyYW1ld29ya19mYWlsdXJlXwor
Z3JlcCAtUCAnXFcnIGluID4gb3V0IHx8IGZhaWw9MQorY29tcGFyZSBpbiBvdXQgfHwgZmFpbD0x
CisKK0V4aXQgJGZhaWwKLS0gCjIuNDAuMC4zNjMuZzljNjk5MGNjYTIKCg==
--000000000000545b5d05fa74110f--




Information forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.

Message received at 62983 <at> debbugs.gnu.org:


Received: (at 62983) by debbugs.gnu.org; 21 Apr 2023 20:21:07 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 21 16:21:07 2023
Received: from localhost ([127.0.0.1]:41241 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ppxFi-0003xC-Tc
	for submit <at> debbugs.gnu.org; Fri, 21 Apr 2023 16:21:07 -0400
Received: from mail-pf1-f170.google.com ([209.85.210.170]:62527)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <carenas@HIDDEN>) id 1ppxFd-0003wG-SC
 for 62983 <at> debbugs.gnu.org; Fri, 21 Apr 2023 16:21:05 -0400
Received: by mail-pf1-f170.google.com with SMTP id
 d2e1a72fcca58-63d4595d60fso16685142b3a.0
 for <62983 <at> debbugs.gnu.org>; Fri, 21 Apr 2023 13:21:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20221208; t=1682108456; x=1684700456;
 h=in-reply-to:content-transfer-encoding:content-disposition
 :mime-version:references:message-id:subject:cc:to:from:date:from:to
 :cc:subject:date:message-id:reply-to;
 bh=Wk43UOhQN17QhkherOf7+1kJXZ/T0Azv/V5Gck1V8yI=;
 b=EX+/I0JW8OfQXTettP3EdwxwBBZGH4eJBYyHK2Mka8m/FXxhRaMGn1nMf/ADz8110k
 MbsWwzQSgIAdGebuON3yOZJB/hI9S6h2wf2GMc88X3h6ZQyJ3qGEJYTsJ5T4LCbLw/eT
 65jhxYmMq6IWbkU9sq/Ohjqswxru8tpI6ngaSyI8D1Pyy4cBSUxL+ypA1hpq4yfQ60Pp
 Z7jnM1Vl6+Oxlb9l6ExHQ7EVgSGcRXwtUhYC1hI3Bt1CAOQKFkF14sI/gQE8rtTRJYZz
 U9j1rESHMreZM2tsCatvnglm77wBXwB/kDjioSd/XEir7gW4T15DwMXvyJm9EaEhcb+e
 tljQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1682108456; x=1684700456;
 h=in-reply-to:content-transfer-encoding:content-disposition
 :mime-version:references:message-id:subject:cc:to:from:date
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=Wk43UOhQN17QhkherOf7+1kJXZ/T0Azv/V5Gck1V8yI=;
 b=lJX099+vjuw27hGIeTa5wLva1HgqFl+F5AQZVbiS0YJNz51OMNM1dDG2sAbXUxycWm
 M1gv+fSfFhkGTm2bqEXO2VQZLIKs4e5cCrA+kbsvbkYEksqG5o3HjXog0ksTuN8iGlmh
 A9MwwOMTGTjBYyTWEszn/tMg7qgfsrPpo1TYaI7aw47I1POFaQSpwk7T+i1srZWMcdfu
 kddFU5Gg1M7OJBCdYO5jOpmVFmylgfgV6OA46e+d4NRQ3ORMKXpj5+0srGhjewEN60Kj
 pRgmrnzroJBlU3J+BN+SD5J6KxTOHbTqKG9gKPU0zyFgpLvyZOVS6gceKSBDy7Re1ZBB
 0uNg==
X-Gm-Message-State: AAQBX9ePmdOTUN0tmfhuo9TlrJD56ENpL0Rdh5MpunighgQpupt3Unva
 tCABQxnVvZLHQQNNAdOsBww=
X-Google-Smtp-Source: AKy350bUDgVMuqqTOZBY4eZJ03dMT5eXUck6H+eLbaEjZtz5lXt2miiI6VYc+AlXx63JmuUNJFVyUw==
X-Received: by 2002:a17:902:ecd0:b0:1a6:8548:e0ac with SMTP id
 a16-20020a170902ecd000b001a68548e0acmr6336164plh.34.1682108455616; 
 Fri, 21 Apr 2023 13:20:55 -0700 (PDT)
Received: from Carlos-MacBook-Pro-2.local
 (192-184-219-167.fiber.dynamic.sonic.net. [192.184.219.167])
 by smtp.gmail.com with ESMTPSA id
 h11-20020a170902748b00b001a641e4738asm3090742pll.1.2023.04.21.13.20.54
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 21 Apr 2023 13:20:55 -0700 (PDT)
Date: Fri, 21 Apr 2023 13:20:53 -0700
From: Carlo Marcelo Arenas =?utf-8?B?QmVsw7Nu?= <carenas@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
Message-ID: <zwfll3hke4opx3ueoap3xodaxqf4vqjiy5zsknj4ngouohx63v@nd4npghhit3n>
References: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
 <c82d3567-5dc9-ec84-f656-90e480bd3987@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="kp62zerfaxgdsxut"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <c82d3567-5dc9-ec84-f656-90e480bd3987@HIDDEN>
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 62983
Cc: 62983 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)


--kp62zerfaxgdsxut
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

On Fri, Apr 21, 2023 at 11:42:50AM -0700, Paul Eggert wrote:
> On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negative
> > perl classes, and seems to be easier to replicate when the matching
> > character is a multibyte one.
> 
> Unfortunately that is a little vague. I expect the issue is not limited to
> \D and \W, as there are other ways to specify negative Perl classes.

Correct, it should also affect at least \S, but hadn't been able to trigger
it there.

The bug was that an uninitialized value was being used in the JIT code that
supports the PCRE2_MATCH_INVALID_UTF mode. which is why I said "randomly" in
the commit message.

If you want to be strict, how about the attached patch instead?

> And if
> the bug merely seems to be easier to replicate with multibyte characters, it
> sounds like we may have issues even when matching ASCII characters in a
> UTF-8 locale.

Which the current workaround addresses, since you need both PCRE2_JIT and
PCRE2_MATCH_INVALID_UTF to trigger it, and the subject encoding is irrelevant
for the logic to decide if PCRE2_MATCH_INVALID_UTF gets enabled or not.

> Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We should
> focus our optimization efforts on future PCRE2 versions, and not worry about
> optimizing earlier versions where optimizations complicate maintenance for a
> declining benefit, and are likely to provoke bugs in older versions that as
> time passes will be harder to debug.

Not sure I understand your concern here, but if it is about disabling JIT
insteed, then the possibility of introducing bugs is even bigger since it
affects all versions of PCRE2 (not only 10.34 or newer).

> > Alternatively JIT could be disabled instead, but the option selected has
> > less of an impact on performance.
> 
> Disabling JIT sounds better, as correctness trumps performance. Until the
> bug is fixed (or at least better-understood so that we have a workaround we
> can trust), how about the attached patch instead?

The bug has been fixed already, and will be included in the next release.
There might be additional changes as spelled in that discussion, and indeed
the change to the proposed solution proactively helps with one of those.

It is very unlikely, but some systems might include non 0 values on the
tables for characters over 127 and that might trigger a similar problem that
is yet to be fixed.

Carlo

[1] https://github.com/PCRE2Project/pcre2/commit/2c08b619dc973beacc474dcb67cda8cd366200ce

--kp62zerfaxgdsxut
Content-Type: text/x-patch; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

=46rom 919d4aa016dd979a52b9e5fd3b0ba1d1cf833ac8 Mon Sep 17 00:00:00 2001
=46rom: =3D?UTF-8?q?Carlo=3D20Marcelo=3D20Arenas=3D20Bel=3DC3=3DB3n?=3D <ca=
renas@HIDDEN>
Date: Thu, 20 Apr 2023 18:37:20 -0700
Subject: [PATCH v2] pcre: workaround bug affecting PCRE2_MATCH_INVALID_UTF

PCRE2 has a bug when using PCRE2_MATCH_INVALID_UTF that would
randomly fail to match patterns using perl negative classes
(like \W or \D).

* NEWS: mention this
* src/pcre2search.c: restric impact of the but
not use the problematic flag in all broken versions of PCRE2
only generate locale tables for non Unicode
* tests: add new pcre2-utf-bug224 test with replications for \[W|D]
---
 NEWS                   |  5 +++++
 src/pcresearch.c       | 22 ++++++++++++++--------
 tests/Makefile.am      |  1 +
 tests/pcre-utf8-bug224 | 31 +++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 8 deletions(-)
 create mode 100755 tests/pcre-utf8-bug224

diff --git a/NEWS b/NEWS
index f16c576..3552db1 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,11 @@ GNU grep NEWS                                    -*- out=
line -*-
   when running on 32-bit x86 and ARM hosts using glibc 2.34+.
   [bug introduced in grep 3.9]
=20
+  grep no longer fails to match patterns which relied on negative perl
+  classes like \D or \W when linked with PCRE2 10.34 or newer.
+  [bug introduced in grep 3.8]
+
+
 ** Changes in behavior
=20
   grep --version now prints a line describing the version of PCRE2 it uses.
diff --git a/src/pcresearch.c b/src/pcresearch.c
index e867f49..a64b65b 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -58,6 +58,9 @@ struct pcre_comp
   /* Table, indexed by ! (flag & PCRE2_NOTBOL), of whether the empty
      string matches when that flag is used.  */
   int empty_match[2];
+
+  /* Flags */
+  unsigned binary_safe:1;
 };
=20
 /* Memory allocation functions for PCRE.  */
@@ -130,16 +133,11 @@ jit_exec (struct pcre_comp *pc, char const *subject, =
idx_t search_bytes,
     }
 }
=20
-/* Return true if E is an error code for bad UTF-8, and if pcre2_match
-   could return E because PCRE lacks PCRE2_MATCH_INVALID_UTF.  */
+/* Return true if E is an error code for bad UTF-8 */
 static bool
 bad_utf8_from_pcre2 (int e)
 {
-#ifdef PCRE2_MATCH_INVALID_UTF
-  return false;
-#else
   return PCRE2_ERROR_UTF8_ERR21 <=3D e && e <=3D PCRE2_ERROR_UTF8_ERR1;
-#endif
 }
=20
 /* Compile the -P style PATTERN, containing SIZE bytes that are
@@ -157,6 +155,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignor=
ed, bool exact)
     =3D pcre2_general_context_create (private_malloc, private_free, NULL);
   pcre2_compile_context *ccontext =3D pcre2_compile_context_create (gconte=
xt);
=20
+  pc->binary_safe =3D false;
   if (localeinfo.multibyte)
     {
       uint32_t unicode;
@@ -181,8 +180,13 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t igno=
red, bool exact)
       flags |=3D PCRE2_NEVER_BACKSLASH_C;
 #endif
 #ifdef PCRE2_MATCH_INVALID_UTF
+      /* workaround PCRE2 bug
+         https://github.com/PCRE2Project/pcre2/issues/224 */
+#if PCRE2_MAJOR =3D=3D 10 && PCRE2_MINOR > 42
+      pc->binary_safe =3D true;
       /* Consider invalid UTF-8 as a barrier, instead of error.  */
       flags |=3D PCRE2_MATCH_INVALID_UTF;
+#endif
 #endif
     }
=20
@@ -226,7 +230,9 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignor=
ed, bool exact)
       size =3D re_size;
     }
=20
-  pcre2_set_character_tables (ccontext, pcre2_maketables (gcontext));
+  if (!localeinfo.multibyte)
+    pcre2_set_character_tables (ccontext, pcre2_maketables (gcontext));
+
   pc->cre =3D pcre2_compile ((PCRE2_SPTR) pattern, size, flags,
                            &ec, &e, ccontext);
   if (!pc->cre)
@@ -313,7 +319,7 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t=
 *match_size,
=20
           e =3D jit_exec (pc, subject, line_end - subject,
                         search_offset, options);
-          if (!bad_utf8_from_pcre2 (e))
+          if (pc->binary_safe || !bad_utf8_from_pcre2 (e))
             break;
=20
           idx_t valid_bytes =3D pcre2_get_startchar (pc->data);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 7718f24..9b4422e 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -155,6 +155,7 @@ TESTS =3D						\
   pcre-jitstack					\
   pcre-o					\
   pcre-utf8					\
+  pcre-utf8-bug224				\
   pcre-utf8-w					\
   pcre-w					\
   pcre-wx-backref				\
diff --git a/tests/pcre-utf8-bug224 b/tests/pcre-utf8-bug224
new file mode 100755
index 0000000..549cc43
--- /dev/null
+++ b/tests/pcre-utf8-bug224
@@ -0,0 +1,31 @@
+#!/bin/sh
+# Ensure negative perl classes matches multibyte characters in UTF mode
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=3D.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=3Den_US.UTF-8
+export LC_ALL
+require_pcre_
+
+echo . | grep -qP '(*UTF).' 2>/dev/null \
+  || skip_ 'PCRE unicode support is compiled out'
+
+fail=3D0
+
+# '=C3=B1' (U+00F1)
+printf '\302\221\n' > in || framework_failure_
+grep -P '\D' in > out || fail=3D1
+compare in out || fail=3D1
+
+# =E2=80=9C=F0=9D=84=9E=E2=80=9D (U+1D11E)
+printf '\360\235\204\236\n' > in || framework_failure_
+grep -P '\W' in > out || fail=3D1
+compare in out || fail=3D1
+
+Exit $fail
--=20
2.39.2 (Apple Git-143)


--kp62zerfaxgdsxut--




Information forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.

Message received at 62983 <at> debbugs.gnu.org:


Received: (at 62983) by debbugs.gnu.org; 21 Apr 2023 18:43:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 21 14:43:05 2023
Received: from localhost ([127.0.0.1]:41147 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ppviq-0000wG-IV
	for submit <at> debbugs.gnu.org; Fri, 21 Apr 2023 14:43:05 -0400
Received: from mail.cs.ucla.edu ([131.179.128.66]:39362)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1ppvij-0000vg-J8
 for 62983 <at> debbugs.gnu.org; Fri, 21 Apr 2023 14:43:03 -0400
Received: from localhost (localhost [127.0.0.1])
 by mail.cs.ucla.edu (Postfix) with ESMTP id 590AB3C097AFA;
 Fri, 21 Apr 2023 11:42:51 -0700 (PDT)
Received: from mail.cs.ucla.edu ([127.0.0.1])
 by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id rO4KlsmuiOUB; Fri, 21 Apr 2023 11:42:50 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by mail.cs.ucla.edu (Postfix) with ESMTP id CB0A83C097AFD;
 Fri, 21 Apr 2023 11:42:50 -0700 (PDT)
DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu CB0A83C097AFD
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu;
 s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1682102570;
 bh=FCZiUNZfMbdhdOaA0RTrGf9wLYmpnkxkERYdXuqlmKI=;
 h=Message-ID:Date:MIME-Version:To:From;
 b=np2GLh7NqiBEyEW7N1Vo//fUO5eA4NWslkiVgqiki0czH6S81CcoKShJ7QbDJkQJ9
 xIX4BOCHHP5YVAuJn0pkjio4+QBWU88FnBiKI8fj3ZgCxEXg9r3QULtbgUYRUR/V7B
 X2ZZYT/p837CeMCQXZsBgj8WvfYWytRKLb1IzjgBL+yUOaVwARkKO3r5yQFB4ltuRT
 xjdZfujCOclK/+ph/6M/vcB6sdjZOd0KJSNMluQSlCTa4LWgJWGm9BnUZp05G63wOX
 nY3gRyJ2sd1UGeBy4eiK02LC7XV0ieIHKNvXSJ08Utqf9a3092ZEtWjyoMPqOhZ0rO
 dBRVtcr/NI/OQ==
X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu
Received: from mail.cs.ucla.edu ([127.0.0.1])
 by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id bCSycuizZS6N; Fri, 21 Apr 2023 11:42:50 -0700 (PDT)
Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com
 [172.91.119.151])
 by mail.cs.ucla.edu (Postfix) with ESMTPSA id A764E3C097AFA;
 Fri, 21 Apr 2023 11:42:50 -0700 (PDT)
Content-Type: multipart/mixed; boundary="------------LRtUWsM1TxZJ3GWjVmeLEwal"
Message-ID: <c82d3567-5dc9-ec84-f656-90e480bd3987@HIDDEN>
Date: Fri, 21 Apr 2023 11:42:50 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.10.0
Content-Language: en-US
To: =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=c3=b3n?= <carenas@HIDDEN>
References: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
In-Reply-To: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
X-Spam-Score: -1.1 (-)
X-Debbugs-Envelope-To: 62983
Cc: 62983 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.1 (--)

This is a multi-part message in MIME format.
--------------LRtUWsM1TxZJ3GWjVmeLEwal
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

On 2023-04-20 19:04, Carlo Marcelo Arenas Bel=C3=B3n wrote:
> All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> its JIT implementation that results in failure to match for the negativ=
e
> perl classes, and seems to be easier to replicate when the matching
> character is a multibyte one.

Unfortunately that is a little vague. I expect the issue is not limited=20
to \D and \W, as there are other ways to specify negative Perl classes.=20
And if the bug merely seems to be easier to replicate with multibyte=20
characters, it sounds like we may have issues even when matching ASCII=20
characters in a UTF-8 locale.

Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We=20
should focus our optimization efforts on future PCRE2 versions, and not=20
worry about optimizing earlier versions where optimizations complicate=20
maintenance for a declining benefit, and are likely to provoke bugs in=20
older versions that as time passes will be harder to debug.


> Alternatively JIT could be disabled instead, but the option selected ha=
s
> less of an impact on performance.

Disabling JIT sounds better, as correctness trumps performance. Until=20
the bug is fixed (or at least better-understood so that we have a=20
workaround we can trust), how about the attached patch instead?

--------------LRtUWsM1TxZJ3GWjVmeLEwal
Content-Type: text/x-patch; charset=UTF-8;
 name="0001-grep-use-PCRE2-JIT-only-in-unibyte-locales.patch"
Content-Disposition: attachment;
 filename="0001-grep-use-PCRE2-JIT-only-in-unibyte-locales.patch"
Content-Transfer-Encoding: base64

RnJvbSA0ZWM3MWI2M2Y5YWMwYmIyN2I2MGUxYzk4MDJlZGNiYTg2ODA5OWU4IE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBGcmksIDIxIEFwciAyMDIzIDExOjMxOjEyIC0wNzAwClN1YmplY3Q6IFtQQVRD
SF0gZ3JlcDogdXNlIFBDUkUyIEpJVCBvbmx5IGluIHVuaWJ5dGUgbG9jYWxlcwoKKiBzcmMv
cGNyZXNlYXJjaC5jIChQY29tcGlsZSk6IENhbGwgcGNyZTJfaml0X2NvbXBpbGUgb25seQpp
ZiBpbiBhIG11bHRpYnl0ZSBsb2NhbGUsIHRvIHdvcmsgYXJvdW5kIGEgUENSRTIgSklUIGJ1
Zy4KLS0tCiBORVdTICAgICAgICAgICAgIHwgIDQgKysrKwogc3JjL3BjcmVzZWFyY2guYyB8
IDE3ICsrKysrKysrKysrLS0tLS0tCiAyIGZpbGVzIGNoYW5nZWQsIDE1IGluc2VydGlvbnMo
KyksIDYgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXggZjE2
YzU3Ni4uYjliOGNkYSAxMDA2NDQKLS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC0xMSw2ICsx
MSwxMCBAQCBHTlUgZ3JlcCBORVdTICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgLSotIG91dGxpbmUgLSotCiAgIFVuaWNvZGUgaW50ZXJwcmV0YXRpb25zLgogICBbYnVn
IGludHJvZHVjZWQgaW4gZ3JlcCAzLjEwXQogCisgIFdpdGggLVAsIHBhdHRlcm5zIGxpa2Ug
XEQgYW5kIFxXIG5vdyB3b3JrIGFnYWluIGluIGEgVVRGLTggbG9jYWxlLAorICB3aGVuIGxp
bmtlZCB0byBQQ1JFMiAxMC4zNCBvciBuZXdlci4KKyAgW2J1ZyBpbnRyb2R1Y2VkIGluIGdy
ZXAgMy44XQorCiAgIGdyZXAgbm8gbG9uZ2VyIGZhaWxzIG9uIGZpbGVzIGRhdGVkIGFmdGVy
IHRoZSB5ZWFyIDIwMzgsCiAgIHdoZW4gcnVubmluZyBvbiAzMi1iaXQgeDg2IGFuZCBBUk0g
aG9zdHMgdXNpbmcgZ2xpYmMgMi4zNCsuCiAgIFtidWcgaW50cm9kdWNlZCBpbiBncmVwIDMu
OV0KZGlmZiAtLWdpdCBhL3NyYy9wY3Jlc2VhcmNoLmMgYi9zcmMvcGNyZXNlYXJjaC5jCmlu
ZGV4IGU4MmJmODYuLjQwODZiYmMgMTAwNjQ0Ci0tLSBhL3NyYy9wY3Jlc2VhcmNoLmMKKysr
IGIvc3JjL3BjcmVzZWFyY2guYwpAQCAtMjQzLDEzICsyNDMsMTggQEAgUGNvbXBpbGUgKGNo
YXIgKnBhdHRlcm4sIGlkeF90IHNpemUsIHJlZ19zeW50YXhfdCBpZ25vcmVkLCBib29sIGV4
YWN0KQogICBwYy0+bWNvbnRleHQgPSBOVUxMOwogICBwYy0+ZGF0YSA9IHBjcmUyX21hdGNo
X2RhdGFfY3JlYXRlX2Zyb21fcGF0dGVybiAocGMtPmNyZSwgZ2NvbnRleHQpOwogCi0gIC8q
IElnbm9yZSBhbnkgZmFpbHVyZSByZXR1cm4gZnJvbSBwY3JlMl9qaXRfY29tcGlsZSwgYXMg
dGhhdCBtZXJlbHkKLSAgICAgbWVhbnMgSklUIHdvbid0IGJlIHVzZWQgZHVyaW5nIG1hdGNo
aW5nLiAgKi8KLSAgcGNyZTJfaml0X2NvbXBpbGUgKHBjLT5jcmUsIFBDUkUyX0pJVF9DT01Q
TEVURSk7CisgIC8qIERvIG5vdCB1c2UgUENSRTIgSklUIGluIG11bHRpYnl0ZSBsb2NhbGVz
IDxodHRwczovL2J1Z3MuZ251Lm9yZy82Mjk4Mz4uCisgICAgIEZJWE1FOiB3aGVuIHRoZSBQ
Q1JFMiBidWcgaXMgZml4ZWQgb3IgYSByZWxpYWJsZSB3b3JrYXJvdW5kIGZvdW5kLiAgKi8K
KyAgaWYgKCFsb2NhbGVpbmZvLm11bHRpYnl0ZSkKKyAgICB7CisgICAgICAvKiBJZ25vcmUg
YW55IGZhaWx1cmUgcmV0dXJuIGZyb20gcGNyZTJfaml0X2NvbXBpbGUsIGFzIHRoYXQgbWVy
ZWx5CisgICAgICAgICBtZWFucyBKSVQgd29uJ3QgYmUgdXNlZCBkdXJpbmcgbWF0Y2hpbmcu
ICAqLworICAgICAgcGNyZTJfaml0X2NvbXBpbGUgKHBjLT5jcmUsIFBDUkUyX0pJVF9DT01Q
TEVURSk7CiAKLSAgLyogVGhlIFBDUkUgZG9jdW1lbnRhdGlvbiBzYXlzIHRoYXQgYSAzMiBL
aUIgc3RhY2sgaXMgdGhlIGRlZmF1bHQuICAqLwotICBwYy0+aml0X3N0YWNrID0gTlVMTDsK
LSAgcGMtPmppdF9zdGFja19zaXplID0gMzIgPDwgMTA7CisgICAgICAvKiBUaGUgUENSRSBk
b2N1bWVudGF0aW9uIHNheXMgdGhhdCBhIDMyIEtpQiBzdGFjayBpcyB0aGUgZGVmYXVsdC4g
ICovCisgICAgICBwYy0+aml0X3N0YWNrID0gTlVMTDsKKyAgICAgIHBjLT5qaXRfc3RhY2tf
c2l6ZSA9IDMyIDw8IDEwOworICAgIH0KIAogICBwYy0+ZW1wdHlfbWF0Y2hbZmFsc2VdID0g
cGNyZV9leGVjIChwYywgIiIsIDAsIDAsIFBDUkUyX05PVEJPTCk7CiAgIHBjLT5lbXB0eV9t
YXRjaFt0cnVlXSA9IHBjcmVfZXhlYyAocGMsICIiLCAwLCAwLCAwKTsKLS0gCjIuMzkuMgoK


--------------LRtUWsM1TxZJ3GWjVmeLEwal--




Information forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.

Message received at 62983 <at> debbugs.gnu.org:


Received: (at 62983) by debbugs.gnu.org; 21 Apr 2023 02:35:26 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 20 22:35:26 2023
Received: from localhost ([127.0.0.1]:38983 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ppgcQ-0007X1-1E
	for submit <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:35:26 -0400
Received: from mail-lj1-f178.google.com ([209.85.208.178]:54566)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@HIDDEN>) id 1ppgcN-0007Wl-W9
 for 62983 <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:35:24 -0400
Received: by mail-lj1-f178.google.com with SMTP id
 38308e7fff4ca-2a8db10a5d4so11144561fa.1
 for <62983 <at> debbugs.gnu.org>; Thu, 20 Apr 2023 19:35:23 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1682044518; x=1684636518;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=20+5sD91/V7PjGlb8363BnX8KzCSk76PejvcA+rQ+2c=;
 b=YvBcAKEbmK0fxlNEWRkPnFWB1MovdJY5PPmrM2H1NAgEH4oln0cVjzyl2ke6MjyHvU
 Zj1jDwvgeZ2rg2uO1WwCl47OpAlyIgRYDEe4Fuc5wf49MZwxhrp7JOocZVNMQF2fZXfO
 fCat8LifIOhS9h8ZaVEguHbrIvW/D0AlGMREo07wEuPe8Z+kjwtP1sRTOIu/TnRZwH+U
 t7nZaAIEZmkca7/u4Ib4rX+90FXahG0S5kkJZK8UlIImr5gNeq9hrVItvwQlx6P/oqtO
 oBJsnrRAlGrIEFzvpJgstHkbO6RYQHw8ljq7o4ZWdCHtsJYHinf0kZCCgYfL1ybDP18Z
 vXlQ==
X-Gm-Message-State: AAQBX9ex03x6RWP8mVGy1/o3yo8vBzdnDcjMSCfYEwRCV9+J4wzUMxGh
 WydLL3Qk8EDERYCSsBj9+xAeMyn2Qs13poTUNekPbOvT//k=
X-Google-Smtp-Source: AKy350awN0ZuXmQ+APErZpYK8anaAmsx9J78LuneGiYPJPYwNf55GJ664eI3k16vgEi5E04kipfNiiNtxeikf8r6uQI=
X-Received: by 2002:a05:651c:c2:b0:2a8:ea22:28b1 with SMTP id
 2-20020a05651c00c200b002a8ea2228b1mr190996ljr.21.1682044518009; Thu, 20 Apr
 2023 19:35:18 -0700 (PDT)
MIME-Version: 1.0
References: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
 <CA+8g5KF2Sr5RaeQJihvQqgZGVnYUbsfATBfK6FMRed9tyn=9RA@HIDDEN>
In-Reply-To: <CA+8g5KF2Sr5RaeQJihvQqgZGVnYUbsfATBfK6FMRed9tyn=9RA@HIDDEN>
From: Jim Meyering <jim@HIDDEN>
Date: Thu, 20 Apr 2023 19:35:05 -0700
Message-ID: <CA+8g5KHuAJw=p74wtKuEhwb=k9761qatotE4Evgj=aACHz675A@HIDDEN>
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
To: =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@HIDDEN>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.2 (/)
X-Debbugs-Envelope-To: 62983
Cc: 62983 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.8 (/)

On Thu, Apr 20, 2023 at 7:33=E2=80=AFPM Jim Meyering <jim@HIDDEN> wro=
te:
>
> On Thu, Apr 20, 2023 at 7:05=E2=80=AFPM Carlo Marcelo Arenas Bel=C3=B3n
> <carenas@HIDDEN> wrote:
> > All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> > its JIT implementation that results in failure to match for the negativ=
e
> > perl classes, and seems to be easier to replicate when the matching
> > character is a multibyte one.
> >
> > Disable that flag and use the original fallback instead.
> >
> > Alternatively JIT could be disabled instead, but the option selected ha=
s
> > less of an impact on performance.
>
> Thanks for the patch! Is there any PCRE-upstream discussion about this?
> If so, I'd like to reference that from your commit log.

Oh! I see it in the test file:
  https://github.com/PCRE2Project/pcre2/issues/224




Information forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.

Message received at 62983 <at> debbugs.gnu.org:


Received: (at 62983) by debbugs.gnu.org; 21 Apr 2023 02:33:46 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 20 22:33:46 2023
Received: from localhost ([127.0.0.1]:38973 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ppgao-0007U6-Fi
	for submit <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:33:46 -0400
Received: from mail-lj1-f169.google.com ([209.85.208.169]:50269)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@HIDDEN>) id 1ppgam-0007Ts-Fg
 for 62983 <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:33:45 -0400
Received: by mail-lj1-f169.google.com with SMTP id
 38308e7fff4ca-2a7af0cb2e6so10780931fa.0
 for <62983 <at> debbugs.gnu.org>; Thu, 20 Apr 2023 19:33:44 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1682044418; x=1684636418;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=wrrEzIw5vKWzp3X2FvL4LMMUnQHV7vrOh4NwQ+RlLfU=;
 b=kBygo4sJmVjQmR5w5ksRq8uwvd+G06gJ+AmNeTYy2RUbrYvAYngG2fbaxMtrCFb/Lx
 P2O5L7bzltaTGfuVQyUYw2u+nl8vQuwmhati29ngpnxiC4BvXCNisz9moQc24Y4Fo14W
 yxEnqUoENbz1sG1ByXCwsBv+yMwcW7IIP7ycRNOLpxeHr3rNm5yxjXSBmynnO3MLQD2B
 6DYJFK8lLMtzf8YVBzw4rHn8sXnGRoNQbTsvRcAD31zhfdBr1nFos0jnYfqnKrKH6avl
 k8sxooL98ZOtFuMwXReOlJiRk7H3LVcOfJcGW7PcWiZgV9oQ2tP8QFG2NaM+XFGo4xXt
 3X6g==
X-Gm-Message-State: AAQBX9dqFt+q08y0k1hnpzJjJDPMCxlHltEzyCalHmLrEImZ949pZMBP
 2SWZ5w1D8rAZHz/yl7QnCWNPuWLaxqPwvBgUHAY=
X-Google-Smtp-Source: AKy350Y2m3+goJ+sTvGM5zCD6aiD2v+AHAmxF+vQaiVYUtEDCB+01gz2uwM0yxIeEax1XzsvmjHanfZ9N4iu6Ty3FFk=
X-Received: by 2002:a2e:9143:0:b0:2a8:c8c5:c769 with SMTP id
 q3-20020a2e9143000000b002a8c8c5c769mr237552ljg.36.1682044418218; Thu, 20 Apr
 2023 19:33:38 -0700 (PDT)
MIME-Version: 1.0
References: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
In-Reply-To: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
From: Jim Meyering <jim@HIDDEN>
Date: Thu, 20 Apr 2023 19:33:25 -0700
Message-ID: <CA+8g5KF2Sr5RaeQJihvQqgZGVnYUbsfATBfK6FMRed9tyn=9RA@HIDDEN>
Subject: Re: bug#62983: workaround PCRE2 bug affecting at least \D and \W
To: =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@HIDDEN>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.2 (/)
X-Debbugs-Envelope-To: 62983
Cc: 62983 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.8 (/)

On Thu, Apr 20, 2023 at 7:05=E2=80=AFPM Carlo Marcelo Arenas Bel=C3=B3n
<carenas@HIDDEN> wrote:
> All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
> its JIT implementation that results in failure to match for the negative
> perl classes, and seems to be easier to replicate when the matching
> character is a multibyte one.
>
> Disable that flag and use the original fallback instead.
>
> Alternatively JIT could be disabled instead, but the option selected has
> less of an impact on performance.

Thanks for the patch! Is there any PCRE-upstream discussion about this?
If so, I'd like to reference that from your commit log.




Information forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 21 Apr 2023 02:04:28 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 20 22:04:28 2023
Received: from localhost ([127.0.0.1]:38963 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ppg8R-0006oV-CY
	for submit <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:04:27 -0400
Received: from lists.gnu.org ([209.51.188.17]:39918)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <carenas@HIDDEN>) id 1ppg8P-0006oO-Uz
 for submit <at> debbugs.gnu.org; Thu, 20 Apr 2023 22:04:26 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <carenas@HIDDEN>) id 1ppg8O-0000c8-Uw
 for bug-grep@HIDDEN; Thu, 20 Apr 2023 22:04:24 -0400
Received: from mail-pl1-x62b.google.com ([2607:f8b0:4864:20::62b])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <carenas@HIDDEN>) id 1ppg8M-0004Gj-VD
 for bug-grep@HIDDEN; Thu, 20 Apr 2023 22:04:24 -0400
Received: by mail-pl1-x62b.google.com with SMTP id
 d9443c01a7336-1a814fe0ddeso19107875ad.2
 for <bug-grep@HIDDEN>; Thu, 20 Apr 2023 19:04:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20221208; t=1682042660; x=1684634660;
 h=content-transfer-encoding:content-disposition:mime-version
 :message-id:subject:to:from:date:from:to:cc:subject:date:message-id
 :reply-to; bh=TYmXZJr7ra4gvL5q34caCGe9Ko78VviEasW23gF3cOU=;
 b=e0gU88Rzsav58gCbIhy58wLnYWM29+7DVrkdAfg0PxhfRWwzN50UdjJ9fDellXTikV
 7Z3Oxc/CUB1mMr3TABji7Ylrphuud56ULpVpCY6IUSPr4z7KQ2lDh8FYyAa1otaPute9
 uzDzg46jCBMw4UNEZqYcgX3EkIF2cEPWNIxjiNyHthQa9YSMptrRPmnNBtx+RQQKC7R4
 WmqAmO3kUJv0NcjqE2Lzcb/x5fejIn2XINoQFh2CJGk0QCdbN87yAX/yWPAcuDig63ls
 RsU7wptbwMuJbv29V3JQclW/kEwoiLizf9FArAZvjVaWWTHxd5yQAZZYvHo00cV5yoQ9
 JYsg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1682042660; x=1684634660;
 h=content-transfer-encoding:content-disposition:mime-version
 :message-id:subject:to:from:date:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=TYmXZJr7ra4gvL5q34caCGe9Ko78VviEasW23gF3cOU=;
 b=R6Y/NUml57IdfunVY403SvGu0UpN/D+H57q5gPmNu3FFJ58Go/yF43ay99bLcCSmeq
 qCr5wfZ3vNn0S8SAgOvhR9vg9mBb0MYnlc/4XicmBWinIWV+DDaf8T8dJF2zwXifBKHZ
 58V6M8o7YRMoYilvbPcmGLq0qNgN33BFNS/+iCIniq4CMaNG1bqZHEZ079dAptF7KVgk
 h+4EYyHINeenVNl+XeLPveLUn/3bgp6WQS3oVcTNrP9ImD7fN234Lue4hNrf4Snm3zED
 RFGllB6pD+qxv3XEyyU9GHO4Z8EMcj6w/nVuATha7EwIOzS1o80ax1aN7KPxeBznFOK/
 or9Q==
X-Gm-Message-State: AAQBX9dXFYxBAxDOJd31MKmzzln7lZB2x1LQK5twzup6p9gMbCSe1nQB
 cGlWeIJ69JLXvpQNc1wPse/vtFlIsk4=
X-Google-Smtp-Source: AKy350YfKBDo/K13bJdY9wigjpw6JPkXR+fSx3/MnPKy9TtaSe2nuuMbdeu/dLv0XYNcIE2UwAVVmg==
X-Received: by 2002:a17:903:1ce:b0:1a2:a8d0:838e with SMTP id
 e14-20020a17090301ce00b001a2a8d0838emr3165077plh.61.1682042660034; 
 Thu, 20 Apr 2023 19:04:20 -0700 (PDT)
Received: from Carlos-MacBook-Pro-2.local
 (192-184-219-167.fiber.dynamic.sonic.net. [192.184.219.167])
 by smtp.gmail.com with ESMTPSA id
 c23-20020a170902849700b001a05122b562sm1683684plo.286.2023.04.20.19.04.18
 for <bug-grep@HIDDEN>
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 20 Apr 2023 19:04:19 -0700 (PDT)
Date: Thu, 20 Apr 2023 19:04:18 -0700
From: Carlo Marcelo Arenas =?utf-8?B?QmVsw7Nu?= <carenas@HIDDEN>
To: bug-grep@HIDDEN
Subject: workaround PCRE2 bug affecting at least \D and \W
Message-ID: <mseeglsi46hm3qor5pdj6xkejip7lgyqpvata65cakztcgwgoq@hsrhke2bfjgd>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="h7otyrncashcyzty"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::62b;
 envelope-from=carenas@HIDDEN; helo=mail-pl1-x62b.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -1.3 (-)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)


--h7otyrncashcyzty
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
its JIT implementation that results in failure to match for the negative
perl classes, and seems to be easier to replicate when the matching
character is a multibyte one.

Disable that flag and use the original fallback instead.

Alternatively JIT could be disabled instead, but the option selected has
less of an impact on performance.

Carlo

--h7otyrncashcyzty
Content-Type: text/plain; charset=utf-8
Content-Disposition: attachment;
	filename="0001-pcre-workaround-bug-affecting-W-or-D.patch"
Content-Transfer-Encoding: 8bit

From 9194c8e9f9ca7315c2e8c25a7986d0690fb31d7c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@HIDDEN>
Date: Thu, 20 Apr 2023 18:37:20 -0700
Subject: [PATCH] pcre: workaround bug affecting \W or \D

PCRE2 has a bug when using PCRE2_MATCH_INVALID_UTF that would
randomly fail to match patterns using \W or \D.

* NEWS: mention this
* src/pcre2search.c: not use the problematic flag in all broken
  versions of PCRE2
* tests: add new pcre2-utf-bug224 test
---
 NEWS                   |  5 +++++
 src/pcresearch.c       | 23 ++++++++++++++---------
 tests/Makefile.am      |  1 +
 tests/pcre-utf8-bug224 | 31 +++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 9 deletions(-)
 create mode 100755 tests/pcre-utf8-bug224

diff --git a/NEWS b/NEWS
index f16c576..8e371dc 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,11 @@ GNU grep NEWS                                    -*- outline -*-
   when running on 32-bit x86 and ARM hosts using glibc 2.34+.
   [bug introduced in grep 3.9]
 
+  grep no longer fails to match patterns with \D or \W when linked to
+  PCRE2 10.34 or newer.
+  [bug introduced in grep 3.8]
+
+
 ** Changes in behavior
 
   grep --version now prints a line describing the version of PCRE2 it uses.
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 1f82932..6ef0d2e 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -58,6 +58,9 @@ struct pcre_comp
   /* Table, indexed by ! (flag & PCRE2_NOTBOL), of whether the empty
      string matches when that flag is used.  */
   int empty_match[2];
+
+  /* Flags */
+  unsigned binary_safe:1;
 };
 
 /* Memory allocation functions for PCRE.  */
@@ -130,16 +133,11 @@ jit_exec (struct pcre_comp *pc, char const *subject, idx_t search_bytes,
     }
 }
 
-/* Return true if E is an error code for bad UTF-8, and if pcre2_match
-   could return E because PCRE lacks PCRE2_MATCH_INVALID_UTF.  */
+/* Return true if E is an error code for bad UTF-8 */
 static bool
 bad_utf8_from_pcre2 (int e)
 {
-#ifdef PCRE2_MATCH_INVALID_UTF
-  return false;
-#else
   return PCRE2_ERROR_UTF8_ERR21 <= e && e <= PCRE2_ERROR_UTF8_ERR1;
-#endif
 }
 
 /* Compile the -P style PATTERN, containing SIZE bytes that are
@@ -157,6 +155,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
     = pcre2_general_context_create (private_malloc, private_free, NULL);
   pcre2_compile_context *ccontext = pcre2_compile_context_create (gcontext);
 
+  pc->binary_safe = false;
   if (localeinfo.multibyte)
     {
       uint32_t unicode;
@@ -181,8 +180,14 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
       flags |= PCRE2_NEVER_BACKSLASH_C;
 #endif
 #ifdef PCRE2_MATCH_INVALID_UTF
-      /* Consider invalid UTF-8 as a barrier, instead of error.  */
-      flags |= PCRE2_MATCH_INVALID_UTF;
+      /* workaround PCRE2 bug
+         https://github.com/PCRE2Project/pcre2/issues/224 */
+#if PCRE2_MAJOR == 10 && PCRE2_MINOR <= 42
+      pc->binary_safe = !strstr (pattern, "\\D") && !strstr (pattern, "\\W");
+      if (pc->binary_safe)
+        /* Consider invalid UTF-8 as a barrier, instead of error.  */
+        flags |= PCRE2_MATCH_INVALID_UTF;
+#endif
 #endif
     }
 
@@ -313,7 +318,7 @@ Pexecute (void *vcp, char const *buf, idx_t size, idx_t *match_size,
 
           e = jit_exec (pc, subject, line_end - subject,
                         search_offset, options);
-          if (!bad_utf8_from_pcre2 (e))
+          if (pc->binary_safe || !bad_utf8_from_pcre2 (e))
             break;
 
           idx_t valid_bytes = pcre2_get_startchar (pc->data);
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 7718f24..9b4422e 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -155,6 +155,7 @@ TESTS =						\
   pcre-jitstack					\
   pcre-o					\
   pcre-utf8					\
+  pcre-utf8-bug224				\
   pcre-utf8-w					\
   pcre-w					\
   pcre-wx-backref				\
diff --git a/tests/pcre-utf8-bug224 b/tests/pcre-utf8-bug224
new file mode 100755
index 0000000..739e7b5
--- /dev/null
+++ b/tests/pcre-utf8-bug224
@@ -0,0 +1,31 @@
+#!/bin/sh
+# Ensure \D and \W matches multibyte characters in UTF mode
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+echo . | grep -qP '(*UTF).' 2>/dev/null \
+  || skip_ 'PCRE unicode support is compiled out'
+
+fail=0
+
+# 'ñ' (U+00F1)
+printf '\302\221\n' > in || framework_failure_
+grep -P '\D' in > out || fail=1
+compare in out || fail=1
+
+# “𝄞” (U+1D11E)
+printf '\360\235\204\236\n' > in || framework_failure_
+grep -P '\W' in > out || fail=1
+compare in out || fail=1
+
+Exit $fail
-- 
2.39.2 (Apple Git-143)


--h7otyrncashcyzty--




Acknowledgement sent to Carlo Marcelo Arenas Belón <carenas@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#62983; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Sat, 29 Apr 2023 07:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.