Received: (at 60690) by debbugs.gnu.org; 8 Apr 2023 22:45:30 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Apr 08 18:45:30 2023 Received: from localhost ([127.0.0.1]:59791 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1plHJJ-0001qI-Nz for submit <at> debbugs.gnu.org; Sat, 08 Apr 2023 18:45:29 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:37202) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1plHJI-0001q4-96 for 60690 <at> debbugs.gnu.org; Sat, 08 Apr 2023 18:45:29 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 41B333C09FA05; Sat, 8 Apr 2023 15:45:22 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id bqtIu2mkYvMC; Sat, 8 Apr 2023 15:45:21 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 3060E3C09FA06; Sat, 8 Apr 2023 15:45:21 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 3060E3C09FA06 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680993921; bh=wfSiy/8QBl9l5t1AHfLCq1MtZCAo+ITeL91CiR2IfSI=; h=Message-ID:Date:MIME-Version:To:From; b=KEAt4wad8SgYYOCJHM5fxsHoUZJRX8HMQdovS7SeowO071mZ5rKSb3j5fisu7KfY0 HKx539tdnVAGUS8PpHxS+2xvBneQjlFM/rpwCuaR+55NnBKyCw8oUhBxQCsjKbfDHL G+9b6XuJv5tqiMmX/cF+YUzD4/awMgdO2O86Tc9NvK7mflbXqj7ss0wwxcoV95KsDf kZS+LpBXDUe0Lp8qXMBfc895cHiAIu29FGcl1eAwt8CD5iS1gP3l5wC+LiLSEx8/IP BIDcWIPeIVj/TDsT6k3rHL5QlgWN+C6ZtajvPYgwZasLyKkwxLo5E4z32GBrVEEcE+ ubVRY8Xvb/0yw== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id tok1hIEic2OY; Sat, 8 Apr 2023 15:45:21 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id E68933C09FA05; Sat, 8 Apr 2023 15:45:20 -0700 (PDT) Message-ID: <43d04851-2463-2922-44e3-075080129ec3@HIDDEN> Date: Sat, 8 Apr 2023 15:45:20 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: Carlo Arenas <carenas@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@HIDDEN> <065bcdcb-5770-5384-5afe-4a4d29272274@HIDDEN> <CAPUEspjtN-cwm=Nn=hMCcbOcOgPaVHsBfLW9TXn1HZrxtRR3BQ@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <CAPUEspjtN-cwm=Nn=hMCcbOcOgPaVHsBfLW9TXn1HZrxtRR3BQ@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: demerphq <demerphq@HIDDEN>, pcre2-dev@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-07 22:01, Carlo Arenas wrote: > Not sure I follow the whole logic here, but PCRE2[3] (search for > "general category" which is what the "gc" above stands for) only > supports the abbreviated form of the unicode classes and `Nd` is > indeed the one that corresponds to `Decimal_Number`. That's fine: all that UTS#18[1] requires is that PCRE2 provide syntax for a regular expression that matches the Decimal Number class. Which PCRE2 does, via \p{Nd}. The logic is that UTF#18 does not require that \d must behave like \p{Nd}, or even that \p{gc=Decimal_Number} must behave like \p{Nd}. It merely requires that there be some syntax for matching Decimal Number, and it says the choice of syntax is up to the implementer. This is why UTF#18 doesn't require that \d must also match non-ASCII digits (which is what I think Yves was saying). [1]: https://unicode.org/reports/tr18/
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 8 Apr 2023 05:01:36 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Apr 08 01:01:36 2023 Received: from localhost ([127.0.0.1]:57538 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pl0hj-000430-Li for submit <at> debbugs.gnu.org; Sat, 08 Apr 2023 01:01:36 -0400 Received: from mail-wm1-f49.google.com ([209.85.128.49]:33695) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <carenas@HIDDEN>) id 1pl0hg-00042m-Ut for 60690 <at> debbugs.gnu.org; Sat, 08 Apr 2023 01:01:33 -0400 Received: by mail-wm1-f49.google.com with SMTP id v20-20020a05600c471400b003ed8826253aso5156107wmo.0 for <60690 <at> debbugs.gnu.org>; Fri, 07 Apr 2023 22:01:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680930087; x=1683522087; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NF72RZy6Lg3Abkq5r49TDt8ODdQqgH+JoDE9ap8+4F4=; b=RkDOIn1VvUUUJJTAd8XS+cL4rPVeUVl6zGK2hKOgtuZnqqV5QUJX5S5t+mNnBuMt8H Nnn8zTuPckM2vaCqMSs5H8PbF4nwSpe9xEaFNYkQXIVjVW6aZpPgzS0Xcp4XNMm9R5hE MMBXIMv0Q0XIXfibQQhUZL7zeKWgjB7JW/rY2D+uxAYfKbzTYENLOwCSBL8N6h9fnebv WVtAogGGlTp8QYIHQzY/BowSqZ/EC9I2JNScbKTviV2ph8DU7Bxbd745pJltWYv9+Z+L E4LmisqGfK/SNiaTHn//+aks0Oig10/Y8FRKOdObAIMDEfCZY1/6cRnhoQge/nm1Z5IB zzzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680930087; x=1683522087; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NF72RZy6Lg3Abkq5r49TDt8ODdQqgH+JoDE9ap8+4F4=; b=zJ7IISlj8KF471B27GePPl1oWlU16Wr7lTjEa1MxMK5pKamgghULeOObUuEA2Zzokp PuHqon0Yh7MGxLCCb1coHqQldGKfigLKC7Pp5TVAs/i8hbaUSNubFKeO+CXlLAYhiDBy Xh65EZiAeE/0OiFThzK91upIWhPjXDqJUlhNqYNYxd6Ihm2InhyzmkroHj9Oazm1LUHP 74P7XyyAovLzTyECVhv1Dn7EvjVcdS6+9fUn4qfsxjS+Mi8al25q/iIHJ/ePQBBdcJl7 Nnoonn3CwTlLvtBQ9vPxPDWzb10vdYvGH7FY77sltuC5FVEIfxtxt5gPz3BZm8+oJl2u ibdw== X-Gm-Message-State: AAQBX9djMu84y91wGw/0+mry0mhcBN5G8XAaiJV3Hst+R4OLGNdiDALZ 4Xd0J+zxFMWPbxoOZ5mm1C55x8GCQ3okxjT3lCHmL02VHCA= X-Google-Smtp-Source: AKy350bH7nP8YmEVysXIlQlls42DSYQTM4X3Sfk0P0GokhzgQMdR1aXzdqLAIRuksRug0GVyeuyA1sO3LinceVZDyBY= X-Received: by 2002:a05:600c:3786:b0:3ee:143f:786d with SMTP id o6-20020a05600c378600b003ee143f786dmr158379wmr.4.1680930086816; Fri, 07 Apr 2023 22:01:26 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@HIDDEN> <065bcdcb-5770-5384-5afe-4a4d29272274@HIDDEN> In-Reply-To: <065bcdcb-5770-5384-5afe-4a4d29272274@HIDDEN> From: Carlo Arenas <carenas@HIDDEN> Date: Fri, 7 Apr 2023 22:01:14 -0700 Message-ID: <CAPUEspjtN-cwm=Nn=hMCcbOcOgPaVHsBfLW9TXn1HZrxtRR3BQ@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq <demerphq@HIDDEN>, pcre2-dev@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) On Fri, Apr 7, 2023 at 12:00=E2=80=AFPM Paul Eggert <eggert@HIDDEN> wr= ote: > > On 2023-04-06 06:39, demerphq wrote: > > > Unicode specifies that \d match any digit > > in any script that it supports. > > "Specifies" is too strong. The Unicode Regular Expressions technical > standard (UTS#18) mentions \d only in Annex C[1], next to the word > "digit" in a column labeled "Property" (even though \d is really syntax > not a property). This is at best an informal recommendation, not a > requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for > illustration and that although it's similar to Perl's, the two syntax > forms may not be exactly the same. So we can't look to UTS#18 for a > definitive way out of the \d mess, as the Unicode folks specifically > delegated matters to us. > > Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C > says "\p{gc=3DDecimal_Number}" is the standard recommended syntax > assignment for digits. However, PCRE2 does not support this syntax; it > supports another variant \p{Nd} that UTS#18 also recommends. So it > appears that PCRE2 already does not implement every recommended aspect > of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support > "\p{gc=3DDecimal_Number}". Not sure I follow the whole logic here, but PCRE2[3] (search for "general category" which is what the "gc" above stands for) only supports the abbreviated form of the unicode classes and `Nd` is indeed the one that corresponds to `Decimal_Number`. Carlo [1]: https://unicode.org/reports/tr18/#Compatibility_Properties [2]: https://unicode.org/reports/tr18/#Conformance [3]: https://pcre2project.github.io/pcre2/doc/html/pcre2pattern.html
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 7 Apr 2023 19:00:26 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 07 15:00:26 2023 Received: from localhost ([127.0.0.1]:57252 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pkrJy-0004Yr-As for submit <at> debbugs.gnu.org; Fri, 07 Apr 2023 15:00:26 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:54636) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pkrJv-0004Ya-IF for 60690 <at> debbugs.gnu.org; Fri, 07 Apr 2023 15:00:24 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 64E6E3C09FA01; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id zp66LFy4pHE7; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 0E32C3C097AFC; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 0E32C3C097AFC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680894017; bh=RLGYiM+qbMJT97xdVCzt/fP+O1nLWgInQnT7uysumQU=; h=Message-ID:Date:MIME-Version:To:From; b=ihF/bUcmkB0vPm8EweR+bhESphyJsVE2A7pZD1bxge1K/Cra6+NeQzVjYAe5GeL/j vaDyolqDxZnVUVXzM2Hd1PtptcnFudh/Qh35TBFhU6mxh2ydjnJn6Ld36fuKNdswI/ ni/1Q4NScfFlhwCsJ21c/CAIySv8QO0sz6my2gtds3PT88nO3+egndruMz8p8ihg7m EOKYEeW5pMXGxkO7EvhyGnTSQg1Yg5sx/0iruvSSIMTUyvWe40UweTpMAIe4NX2GQz M+Ph7M68u3adlMdxx6c4iHzdSaNW22mP8owoQSVfgfA4CCiR4QTAoMwOo44EKwSvHW nRXRIjBoM3wxg== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id hpZkBCsrcJpZ; Fri, 7 Apr 2023 12:00:16 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id C1CF63C09FA02; Fri, 7 Apr 2023 12:00:16 -0700 (PDT) Message-ID: <065bcdcb-5770-5384-5afe-4a4d29272274@HIDDEN> Date: Fri, 7 Apr 2023 12:00:16 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: demerphq <demerphq@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: pcre2-dev@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?UTF-8?Q?mason?= <avarab@HIDDEN>, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-06 06:39, demerphq wrote: > Unicode specifies that \d match any digit > in any script that it supports. "Specifies" is too strong. The Unicode Regular Expressions technical standard (UTS#18) mentions \d only in Annex C[1], next to the word "digit" in a column labeled "Property" (even though \d is really syntax not a property). This is at best an informal recommendation, not a requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for illustration and that although it's similar to Perl's, the two syntax forms may not be exactly the same. So we can't look to UTS#18 for a definitive way out of the \d mess, as the Unicode folks specifically delegated matters to us. Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C says "\p{gc=Decimal_Number}" is the standard recommended syntax assignment for digits. However, PCRE2 does not support this syntax; it supports another variant \p{Nd} that UTS#18 also recommends. So it appears that PCRE2 already does not implement every recommended aspect of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support "\p{gc=Decimal_Number}". Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, that's clearly enough for grep -P to conform to UTS#18 with respect to digits. > A) how do you tell the regular expression > engine what semantics you want and B) how does the regular expression > library identify the encoding in the file, and how does it handle > malformed content in that file. Here's how GNU grep does it: * RE semantics are specified via command-line options like -P. * Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'. * REs do not match encoding errors. > on *nix there is no tradition of using BOM's to > distinguish the 6 different possible encodings of Unicode (UTF-8, > UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe or MS-Windows code these legacy encodings are obviously a big deal. However, there seems little reason to force their nontrivial hassles onto every GNU/Linux program that processes text. A few specialized apps like 'iconv' deal with offbeat encodings, and that is probably a better approach all around. > there seems > to be some level of desire of matching with unicode semantics against > files that are not uniformly encoded in one of these formats. That is a use case, yes. It's what 'strings' and 'grep' do. [1]: https://unicode.org/reports/tr18/#Compatibility_Properties [2]: https://unicode.org/reports/tr18/#Conformance
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 7 Apr 2023 16:48:53 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Apr 07 12:48:53 2023 Received: from localhost ([127.0.0.1]:57153 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pkpGf-0000mU-8T for submit <at> debbugs.gnu.org; Fri, 07 Apr 2023 12:48:53 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:45272) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pkpGc-0000mF-SA for 60690 <at> debbugs.gnu.org; Fri, 07 Apr 2023 12:48:51 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 6AE183C020F7C; Fri, 7 Apr 2023 09:48:44 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id kPQurlYB1bY0; Fri, 7 Apr 2023 09:48:42 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 7DDA13C097AFC; Fri, 7 Apr 2023 09:48:42 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 7DDA13C097AFC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680886122; bh=F+PW77ErQhFQHkkfv0GgRGwScVA/JOfmevt2y4yx7FU=; h=Message-ID:Date:MIME-Version:To:From; b=WlG/Z6XGA9EERDV2TMyOvewGtofk1t15OtmMu2AjQK5Ud/wxn6Av2FWxVJSODV9xt LnOD34IwqAt0s1QtFmVR75c4nJgyc8OepvFXwwlIlnvYjnicUgnD+ueWuRz962V9pw NOQ0pxL35251f5jxoMCRl6hRdxm8N9hCz61yeMIav5YYGhcJ6Vw6um2439yOFuZ5jV hPujxPTSPbKG0lWlYxVVC2Uqeg4hn6cgGODsALqjl5WP55qeoiY/leXchm/fVlu5WZ mETBAPuvzrmFIz6tZ6mdD2hXrn1ft6S0CAGXL0bSYpyQetP4l38QnOxaGQmAGl0J8K VdpEfWuq2Vt7w== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id qML49UNS655D; Fri, 7 Apr 2023 09:48:42 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 385E83C020F7C; Fri, 7 Apr 2023 09:48:42 -0700 (PDT) Message-ID: <767d3617-e35a-a693-6ec8-f65421c68e5f@HIDDEN> Date: Fri, 7 Apr 2023 09:48:40 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 To: demerphq <demerphq@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> <CANgJU+XoyptS8NU+f6uMLrKjQakv=iN2c4DQydVaBVH3dK3s-w@HIDDEN> Content-Language: en-US From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <CANgJU+XoyptS8NU+f6uMLrKjQakv=iN2c4DQydVaBVH3dK3s-w@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: Carlo Arenas <carenas@HIDDEN>, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Philip.Hazel@HIDDEN, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-06 08:45, demerphq wrote: >> Although this causes pcre2grep to mishandle Unicode characters: >> >> $ echo '=C3=86var' | pcre2grep '[Ss=C3=9F]' >> =C3=86var >> >> it mimics Perl 5.36: >> >> $ echo '=C3=86var' | perl -ne 'print $_ if /[Ss=C3=9F]/' >> =C3=86var >> >> so this seems to be what Perl users expect, despite its infelicities. > Actually no, I think you have misunderstood what is happening at the > different layers involved here. No, I understood what was going on. My point was that Perl users seem to=20 have accepted this behavior, even though it does not match what people=20 would ordinarily expect. > What you should have done is something like this: No, for two reasons. First, I'm no Perl expert and so I don't know (and=20 don't particularly want to learn) its complicated Unicode options and=20 calls. Second, /[Ss\x{DF}]/u is hard to read. If I want the S letters of=20 traditional German, I'll write them in the obvious way, as [Ss=C3=9F]. No= =20 doubt Perl will let me do this somehow - but it is telling that none of=20 your examples do it in such a straightforward way. > $ echo '=C3=86var' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}= ]/u' > $ echo 'ba=C3=9F' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{DF}]/u' > ba=C3=9F > $ echo '=C3=86var' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{C6}]/u' > =C3=86var > $ echo '=C3=86var' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{e6}]/ui' > =C3=86var
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 6 Apr 2023 15:45:31 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 06 11:45:31 2023 Received: from localhost ([127.0.0.1]:54768 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pkRnn-0007cd-5p for submit <at> debbugs.gnu.org; Thu, 06 Apr 2023 11:45:31 -0400 Received: from mail-qk1-f172.google.com ([209.85.222.172]:43725) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <demerphq@HIDDEN>) id 1pkRni-0007AX-5P for 60690 <at> debbugs.gnu.org; Thu, 06 Apr 2023 11:45:29 -0400 Received: by mail-qk1-f172.google.com with SMTP id dw4so12443410qkb.10 for <60690 <at> debbugs.gnu.org>; Thu, 06 Apr 2023 08:45:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680795920; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RFsD2+GzDkJ4XHTOiptOBEv1CaiCwfmfe5UABd76rH8=; b=P19rU7JXHzUzmPhWhz/BlYF4wzDQ2YORVksL6gFF6XQacSJKbG+x+xYd2UPbZCG2kR Q3MO22MeT5kYQXRKbNRFBqqfuChlSRkwBMzbpCKErFbRDngXTfPX6qUxHnkSu7wReGxj 6MUlOosx50IBPExa6bXsYOd6hLlVhGSbec70gqQV23Tqa3zrl9zB5dapx6I98WKPxkEJ IrgwCz7lumL6LXjPkHLOnMMQJBoC4qSXmAJBgKulK6szgJ5KmGjNl6JT7b8u9vGVsNf1 UKnvlbrgv5obqZL/zFzB2HOyvtpQJ7Sx4D+wYAJATbqunx5rFvwvfsy8AxeNUe2xOInc FXtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680795920; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RFsD2+GzDkJ4XHTOiptOBEv1CaiCwfmfe5UABd76rH8=; b=xp8r4PliDGyv1z6zYTmRqtCfn3sUexpZy6y6rZMvUeIKzcdd3Ir1KeZ2akKkfsUm5O R2vA4bBvTjGhK1jqvCCqveNYlcxiNZbqJHV4P1NC7/HBHCC29AqyjUhmtThIvIcHfjMr wjIhisAjAv8+qqiokfmenkN20/YDWPfKj/qlWDz09PjHWNlJou7u8//GBooGrdwghY5A JDh7Zu5AXiLUvtRBqCoIzeXp4smj1MburO0xUE1oxij5eUvtrYWMcvt+JBV1jnUe9PSK +Onx3FkrwVppKJ7WDIe/Epnt30hhm5jj7kVht6txYux3oPdzKNPpcPrMsDsMS4YiuJ0J rVog== X-Gm-Message-State: AAQBX9e8qBPZzWlGeSLCBG0LE29k1gz0Wi1DQ9qP3pzdN5S45zPSbIMm 34sp5ZZ6aZ4B/wQ4IScaUvh5QCiDH6sUN5bwswo= X-Google-Smtp-Source: AKy350Z84z2cPpW/yIWeSZZfvcjmNdEMr5kQdi9PtQ0/c+S2q2SYM1PZRitNdtPVA2x7AWMjKqze6Lb7brxBxuFNegk= X-Received: by 2002:a05:620a:410a:b0:742:8868:bfd1 with SMTP id j10-20020a05620a410a00b007428868bfd1mr2737838qko.7.1680795920160; Thu, 06 Apr 2023 08:45:20 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> In-Reply-To: <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> From: demerphq <demerphq@HIDDEN> Date: Thu, 6 Apr 2023 17:45:09 +0200 Message-ID: <CANgJU+XoyptS8NU+f6uMLrKjQakv=iN2c4DQydVaBVH3dK3s-w@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, Junio C Hamano <gitster@HIDDEN>, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) On Wed, 5 Apr 2023 at 20:32, Paul Eggert <eggert@HIDDEN> wrote: > > On 2023-04-04 12:31, Junio C Hamano wrote: > > > My personal inclination is to let Perl folks decide > > and follow them (even though I am skeptical about the wisdom of > > letting '\d' match anything other than [0-9]) > > I looked into what pcre2grep does. It has always done only 8-bit > processing unless you use the -u or --utf option, so plain "pcre2grep > '\d'" matches only ASCII digits. > > Although this causes pcre2grep to mishandle Unicode characters: > > $ echo '=C3=86var' | pcre2grep '[Ss=C3=9F]' > =C3=86var > > it mimics Perl 5.36: > > $ echo '=C3=86var' | perl -ne 'print $_ if /[Ss=C3=9F]/' > =C3=86var > > so this seems to be what Perl users expect, despite its infelicities. Actually no, I think you have misunderstood what is happening at the different layers involved here. Your terminal is rendering =C3=9F as a glyph. But it is almost certainly actually the octets C3 9F (which is the UTF8 canonical representation of the codepoint U+DF). So the code you provided to perl is close to the equivalent of echo '=C3=86var' | perl -ne 'print $_ if /[Ss\x{C3}\x{9F}]/' And if you check, you will see that U+C6 "=C3=86" in utf8 is represented as the octets C3 86. So what you have done is the equivalent of: perl -le'print "\x{C3}\x{86}"' | perl -ne'print $_ if /[Ss\x{C3}\x{9F}]/' which of course matches. \x{C3} matches \x{C3} always and everywhere. What you should have done is something like this: $ echo '=C3=86var' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u' $ echo 'ba=C3=9F' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{DF}]/u' ba=C3=9F $ echo '=C3=86var' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{C6}]/u' =C3=86var $ echo '=C3=86var' | perl -MEncode -ne 'utf8::decode($_); print encode_utf8($_) if /[Ss\x{e6}]/ui' =C3=86var The "utf8::decode($_)" tells perl to decode the input string as though it contained utf8 (which in this case it does). THe /u suffix tells the regex engine that you want Unicode semantics. I believe that the same thing is true of your pcre2grep example. You simply aren't checking what you think you are checking. You terminal renders UTF8 as glyphs, but the programs you are feeding those glyphs to aren't seeing glyphs, they are seeing UTF8 sequences as distinct octets, and are not decoding their input back as codepoints. You could have checked your assumptions by using the -Mre=3Ddebug option to= perl: $ echo '=C3=86var' | perl -Mre=3Ddebug -ne 'print $_ if /[Ss=C3=9F]/' Compiling REx "[Ss%x{c3}%x{9f}]" Final program: 1: ANYOF[Ss\x9F\xC3] (11) 11: END (0) stclass ANYOF[Ss\x9F\xC3] minlen 1 Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n" Matching stclass ANYOF[Ss\x9F\xC3] against "%x{c3}%x{86}var%n" (6 bytes) 0 <> <%x{c3}> | 0| 1:ANYOF[Ss\x9F\xC3](11) 1 <%x{c3}> <%x{86}var> | 0| 11:END(0) Match successful! =C3=86var Freeing REx: "[Ss%x{c3}%x{9f}]" The line: Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n" basically says it all. Perl has not decoded the UTF8 into U+C6, and it has not decoded the UTF8 for U+DF either. Instead you have asked it if the UTF8 sequence that represents U+C6 contains any of the same octets as the UTF8 representation of U+53, U+73 and U+DF would. Which gives the common octet of \x{c3}. > For better Unicode handling one can use pcre2grep's -u or --utf option, > which causes pcre2grep to behave more like GNU grep -P and git grep -P: > "echo '=C3=86var' | pcre2grep -u '[Ss=C3=9F]'" outputs nothing, which I t= hink is > what most people would expect (unless they're Perl users :-). It is what Perl users would expect also, assuming you actually wrote the character class [Ss\x{DF}] and asked for unicode semantics. \x{DF} is the Latin1 codepoint range, so perl will assume that you meant ASCII semantics unless you tell it otherwise. Basically these tests you have quoted here are just examples of garbage in garbage out. Perl has been working together with the Unicode consortium for over 20 years. Afaik we were and are the reference implementation for the spec on regular expression matching in Unicode and we have a long history of working together with the Unicode consortium to refine and implement the spec. You should assume that if Perl seems to have made a gross error in how it does Unicode matching that you are simply using it wrong, we take a great deal of pride in having the best Unicode support there is. https://unicode.org/reports/tr18/ FWIW, i think this email nicely illustrates the issues with git and regular expressions. To do regular expressions properly you need to know a) what semantics do you expect, b) how to decode the text you are matching against. If you want unicode semantics you need to have a way to ask for it. If you want to match against Unicode data then you need a way to determine which of the 6 possible encodings[1] of Unicode data you are using. If you get either wrong you will not get the results you expect. You may even want to deal with cases where you want Unicode semantics, but to match against non-unicode data. For instance Latin-1. In Latin-1 the codepoint U+DF is the *octet* 0xDF. Maybe you want that octet to match "ss" case-insensitively, as a German speaker would expect and as Unicode specifies is correct. Or vice versa, maybe you are like some of the posters to this thread who seem to expect that \d should not match U+16B51 (as a Hmong speaker might expect). Perl resolves these problems at the pattern level by supporting the suffixes /a and /u (for ascii and unicode), and at the string level it supports two type of string, unicode strings, and binary/ASCII strings. By default input is the latter but there are a variety of ways of saying that a file handle should decode to Unicode instead. cheers, Yves [1] UTF-EBCDIC, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. -- perl -Mre=3Ddebug -e "/just|another|perl|hacker/"
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 6 Apr 2023 13:39:51 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Apr 06 09:39:51 2023 Received: from localhost ([127.0.0.1]:53679 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pkPqB-0001yr-49 for submit <at> debbugs.gnu.org; Thu, 06 Apr 2023 09:39:51 -0400 Received: from mail-qv1-f53.google.com ([209.85.219.53]:36810) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <demerphq@HIDDEN>) id 1pkPq8-0001yZ-BS for 60690 <at> debbugs.gnu.org; Thu, 06 Apr 2023 09:39:49 -0400 Received: by mail-qv1-f53.google.com with SMTP id cu4so28359624qvb.3 for <60690 <at> debbugs.gnu.org>; Thu, 06 Apr 2023 06:39:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680788382; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=qHG4jeUv5wcUw/5Ov0dUmw116z+k+P3f0ikY/j9srHA=; b=TzJ51q3xpt2NBeiYICRYi18x5HIblyYRB56/90ewB7G9pvQnl/l1Es41UEI8S/mbYe P6GHj1TDWzDdnf324jBOcx0ruyJoHjZBfDMVwH6WyLHPL0Ur7CHxJzclmE/EGbtEiUlN uVTDgyTf+RY4zPvNfR0g5dbfY+adei57zRCD6neJ3a9KrXdUzaXY7nNHAL4aZBPrIRmS 8OYSbJ3X5VW/q30VxR1wTDAWrhMPHuiGyVdgwtD/tNOqRGIEgIs32FImASbr2BdvINI8 HGmjHRwXeV16+EpIoUUmkp3T4MPa3p5C/H8WGPRMZSXJo9fU2fB51RRVkNAAYf5TFyOW g2eA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680788382; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qHG4jeUv5wcUw/5Ov0dUmw116z+k+P3f0ikY/j9srHA=; b=SElmFFKhXL8dt0nnTl1zNZ67WOIGJOHf/KPyol7IgRWjp+GjWHNRNTQnZRDga9Misy pe62NTUUoijpbLBqrS1nNF0kx+fn/qgkbcMV1kD+LveawbFlUhLBwSfSXvT/+zjd5lsZ Z5YLtIjaBmlAPZlZrliNxG9+B49boa+6ygww3PNpzJ681zvqKkxknQPsoPTHI14pi3Kq zQkagvy2iM/KhlqKKNfQMBLfSqxunS1ibGNzKnAO2gaOXgQOrD0d/tuzi9nZTBxtFGfI hfmTXva+ALCZ66qAe1atQqFDiK7EYXYPaWpGEHiINfvj02A22xzwG6M6rGGN6rrXwohB V51w== X-Gm-Message-State: AAQBX9f9WmRmnq2ZD8k3GRly3AMIL68Dp7BVR4L+v/kNyH6jBfVNVDOs /eOzUgVGqo8bZNYHAbIWmd2MyLOuG8DsqbtmL20= X-Google-Smtp-Source: AKy350Z5c7XEECkeGZ6yZ/3bRWz1vgXrT0l6ZxLjSewl8PQDX5zhWKkhT1VUsRPntK8S9FLhI9nSIgEjyBevA+AplXA= X-Received: by 2002:ad4:58b2:0:b0:56f:378:951 with SMTP id ea18-20020ad458b2000000b0056f03780951mr542261qvb.1.1680788382666; Thu, 06 Apr 2023 06:39:42 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> In-Reply-To: <xmqqttxvzbo8.fsf@HIDDEN> From: demerphq <demerphq@HIDDEN> Date: Thu, 6 Apr 2023 15:39:31 +0200 Message-ID: <CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Junio C Hamano <gitster@HIDDEN> Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: Paul Eggert <eggert@HIDDEN>, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) On Tue, 4 Apr 2023 at 21:31, Junio C Hamano <gitster@HIDDEN> wrote: > > Paul Eggert <eggert@HIDDEN> writes: > > > This is an evolving area. Git master is fiddling with flags and > > options, and so is GNU grep master, and so is PCRE2, and there are > > bugs. If you're running bleeding-edge versions of this code you'll get > > different behavior than if you're running grep 3.8, pcregrep 8.45, > > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has). > > > > What I'm fearing is that we may evolve into mutually incompatible > > interpretations of how Perl regular expressions deal with UTF-8 > > text. That'd be a recipe for confusion down the road. > > Nicely said. My personal inclination is to let Perl folks decide > and follow them (even though I am skeptical about the wisdom of > letting '\d' match anything other than [0-9]), but even in Git > circle there would be different opinions, so I am glad that the > discussion is visible on the list to those who are intrested. Perl matches Unicode text according to the rules specified by the Unicode consortium. It is the reference implementation for Unicode regular expression matching. Unicode specifies that \d match any digit in any script that it supports. Thus \d matches far more codepoints than \p{PosixDigit} or [0-9] would. Be aware that Unicode contains and separates numbers and digits, eg, \x{1EC9E} represents a Lakh, which is used in many Indian languages for 100,000, but which is not considered a *digit* for obvious reasons. FWIW, someone mentioned [[:digit:]] which matches the same as \d does on Unicode strings and under the /u matching flag for regexes in Perl. Arguably this was a mistake, [[:digit:]] is a POSIX character class, and POSIX doesn't support Unicode so it should have matched [0-9] or \p{PosixDigit}. But historically \d and [[:digit:]] in Perl were the same and when \d was extended to meet the Unicode specification [[:digit:]] came along for the ride likely inadvertently, thus \p{PosixDigit} is equivalent to [0-9], but \p{XPosixDigit} is equivalent to \d and [[:digit:]]. I notice that other posts in this thread have moved the conversation on, and covered most of the points I wanted to make here. However I wanted to say that there seem to be two different issues here. The first is "what semantics do i expect from my regular expressions", Unicode or legacy-ASCII, mostly this relates to case-insensitive matching, but things like \d also surface discrepancies. The second is "what encodings does the regular expression engine understand". Unfortunately on *nix there is no tradition of using BOM's to distinguish the 6 different possible encodings of Unicode (UTF-8, UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE), and there seems to be some level of desire of matching with unicode semantics against files that are not uniformly encoded in one of these formats. So the question comes up, A) how do you tell the regular expression engine what semantics you want and B) how does the regular expression library identify the encoding in the file, and how does it handle malformed content in that file. For instance if I have a file which contains snippets of UTF8 encoded data, *and* snippets of data that is illegal in UTF8, what should the regular expression engine do if it is asked to do a case insensitive match against that file. cheers, yves
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 21:21:13 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 17:21:13 2023 Received: from localhost ([127.0.0.1]:52785 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pkAZ6-0001Pg-TR for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 17:21:13 -0400 Received: from mail-wr1-f54.google.com ([209.85.221.54]:36713) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <carenas@HIDDEN>) id 1pkAZ3-0001PQ-Ao for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 17:21:11 -0400 Received: by mail-wr1-f54.google.com with SMTP id i9so37538167wrp.3 for <60690 <at> debbugs.gnu.org>; Wed, 05 Apr 2023 14:21:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680729663; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aCatREPKZ1k9qVjjaXtsQbJYEyUR5HAkP3fROIQp2zc=; b=Ta47UMFMnsY5uoJHl8ZWLWyVcShlCVEcV9LOHBtRHZVisgQyH4JJ4plkso7aUUrFoT O+uLv92ZbmL83qlXvwLWwOiZfATHV4qpsHbAR55Sk+kCyy6KFLOFTD4UFzMvQCZcHhA7 2X1wKhD14DjgqFxevFM3H0Et5TOA9PtmTUvQrgdKbITJ9XiDVCv2Ik4HPutUtJrNjHRH 7QIv181+HI0FBkOECWKiG7d3yWk8ATxn8RpSSHdtlq1Jg1SyPTiWrgEbG41a5fbchxHD zvmoMS8fqbroTCbJlcKk72xXIe4l7yrdvznPRsgiQPsHjPvZqRRN2IZHLNxf3b0CIIg4 /NjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680729663; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aCatREPKZ1k9qVjjaXtsQbJYEyUR5HAkP3fROIQp2zc=; b=voj/9ydXaVTHQrmh2bvemO2Yu3HFYiu64y+x5WtAKVsPikkqLB2VRRGAyp0lpcAkTl k3GmRKR5gdPRJdijb+q05YS3xgw8qxZCzurLchLdYjLCVad7rwOpF03HDz+DB2VBsK8W sqdhS6YDnhOq0eep9kmvSxArVaeFtEmbXxiYIT3cCG648t5GIJ+79ipUFwGtDjAFSunF rtGYFkVVEtV7H7DekYXCwuvvmNWBt/JXLRFRHcPN0HaHfXy6GVYWNrPmxiWd8CfGfyFR a1C1JIXdJryfmIAtNQXKFtMI3wBp7V3dj23exugy5Lb1xEan1sRVjCdDkBkUMCucGHK5 zoIQ== X-Gm-Message-State: AAQBX9eKFmBm6CB0ocK4dxExRa3/0Le5Sq//NFONy6FTNZ1kF+Sm+XD3 /ljm3hs6m1M87fKSm2yp8aLuVQ8OREaeOue7HAY= X-Google-Smtp-Source: AKy350Yug+z1y1f69227N9A9y1YgjMZIWHm96uHR51/doAU4PfBXpU9KcCLPwSBbscke8hJLLb7BFyul+DEL7tmhHjs= X-Received: by 2002:adf:e30e:0:b0:2e4:cbfe:da50 with SMTP id b14-20020adfe30e000000b002e4cbfeda50mr1424397wrj.1.1680729663146; Wed, 05 Apr 2023 14:21:03 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> <CA+8g5KHYqgAZPpTOXWekDpWv-mvj-rBkGu+4MXy4OB1VDeS4Lw@HIDDEN> In-Reply-To: <CA+8g5KHYqgAZPpTOXWekDpWv-mvj-rBkGu+4MXy4OB1VDeS4Lw@HIDDEN> From: Carlo Arenas <carenas@HIDDEN> Date: Wed, 5 Apr 2023 14:20:51 -0700 Message-ID: <CAPUEspjM6PtsY9LiK9Lqb2+H2UrWEfPziWVrOPwZGVpVbx7aJQ@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Jim Meyering <jim@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Wed, Apr 5, 2023 at 12:40 PM Jim Meyering wrote: > > Changing grep -P's \d to match multibyte digits by default would break > an important contract. While I tend to agree[1] (and indeed that is why PCRE2_EXTRA_ASCII_BSD was invented), it would be also important to note that it goes against the Unicode recommendation[2] and it is actually not true [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 3.0 MANY_TO_CC Sent to 10+ recipients 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (carenas[at]gmail.com) 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [209.85.221.54 listed in list.dnswl.org] -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.221.54 listed in wl.mailspike.net] X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Paul Eggert <eggert@HIDDEN>, pcre2-dev@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Philip.Hazel@HIDDEN, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Wed, Apr 5, 2023 at 12:40 PM Jim Meyering wrote: > > Changing grep -P's \d to match multibyte digits by default would break > an important contract. While I tend to agree[1] (and indeed that is why PCRE2_EXTRA_ASCII_BSD was invented), it would be also important to note that it goes against the Unicode recommendation[2] and it is actually not true [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.221.54 listed in wl.mailspike.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [209.85.221.54 listed in list.dnswl.org] 3.0 MANY_TO_CC Sent to 10+ recipients 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (carenas[at]gmail.com) 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record -1.0 MAILING_LIST_MULTI Multiple indicators imply a widely-seen list manager On Wed, Apr 5, 2023 at 12:40=E2=80=AFPM Jim Meyering <jim@HIDDEN> wro= te: > > Changing grep -P's \d to match multibyte digits by default would break > an important contract. While I tend to agree[1] (and indeed that is why PCRE2_EXTRA_ASCII_BSD was invented), it would be also important to note that it goes against the Unicode recommendation[2] and it is actually not true already[3] for Python, .NET or Rust (which means ripgrep behaves like GNU grep -P 3.9). FWIW I also agree that (at least `git grep -P`) should use PCRE2_EXTRA_ASCII_BSD by default as that is what makes more sense in the context of matching source code and using instead `\P{Nd}` if you really want all Unicode digits is not much of a burden, but I am also not sure if that makes sense in other contexts, specially considering that I am obviously biased since the languages I mostly interact with ONLY use arabic numerals and therefore `\d` meaning `[0-9]` seems "normal". Carlo CC: changed to the real email address for PCRE2 development, for full context on this thread use [4] [1] https://github.com/PCRE2Project/pcre2/pull/186 [2] https://unicode.org/reports/tr18/ [3] https://regex101.com/r/S5RW4c/1 [4] https://lore.kernel.org/git/230109.86v8lf297g.gmgdl@HIDDEN= /T/
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 20:04:00 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 16:04:00 2023 Received: from localhost ([127.0.0.1]:52707 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pk9MN-0007ar-K1 for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 16:03:59 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:46502) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pk9ML-0007ab-Ut for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 16:03:58 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id DAD523C09FA02; Wed, 5 Apr 2023 13:03:51 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5aZEMnBhpMz3; Wed, 5 Apr 2023 13:03:51 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 905833C09FA04; Wed, 5 Apr 2023 13:03:51 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 905833C09FA04 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680725031; bh=8PkslEbHjoWrpxk5RHwWEEVpxilllW7PcwBgH/aBFtk=; h=Message-ID:Date:MIME-Version:To:From; b=CftoLGsHsh5+2RqTPkPiiFimYzMr4dQgc415GFbs6dKBjA2ISMPv1p2YYlZn/6k4W yOuq+Fv6giOfadQqgV56YsHcz4rZfo+tCY7rlNTo8MLLSTahRqwc3buVoBzY6zKLVT JC8Ea7lsV+gVxbOwyBIVlj6+pJLNXW1GP/wJ3+mB6e39YJg1MqYj/kv3oF6YCuAaBs LUIWivy/Lnwa+ozTMLlxqmC50PBwy9cZaQ8116gwM5KLLBeWBOwuK4XawQOGxkQa49 DSOvtaQKY/TYKIiVDmqEwrlsZ8ERVJ36HcSlVPQuo11NPCVlb3F7goLAwUQhNes8vk DX5sQ4hLPkE7Q== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id EB-czLSrBsYx; Wed, 5 Apr 2023 13:03:51 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 4A6033C09FA02; Wed, 5 Apr 2023 13:03:51 -0700 (PDT) Message-ID: <ed237a07-2f77-74eb-2f52-49b9b8f08873@HIDDEN> Date: Wed, 5 Apr 2023 13:03:51 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: Jim Meyering <jim@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> <CA+8g5KHYqgAZPpTOXWekDpWv-mvj-rBkGu+4MXy4OB1VDeS4Lw@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <CA+8g5KHYqgAZPpTOXWekDpWv-mvj-rBkGu+4MXy4OB1VDeS4Lw@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: 1.9 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On 2023-04-05 12:40, Jim Meyering wrote: > (C) preserve grep -P's tradition of \d matching only 0..9, and once > grep uses 10.43 or newer, \b and \w will also work as desired. If I understand you correctly, (C) would mean that GNU grep -P, git grep -P, and pcre2grep -u would all use PCRE2_UTF | PCRE2_UCP, and would also use the extra option PCRE2_EXTRA_ASCII_BSD that is pla [...] Content analysis details: (1.9 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 3.0 MANY_TO_CC Sent to 10+ recipients 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record -1.1 NICE_REPLY_A Looks like a legit reply (A) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.9 (/) On 2023-04-05 12:40, Jim Meyering wrote: > (C) preserve grep -P's tradition of \d matching only 0..9, and once > grep uses 10.43 or newer, \b and \w will also work as desired. If I understand you correctly, (C) would mean that GNU grep -P, git grep -P, and pcre2grep -u would all use PCRE2_UTF | PCRE2_UCP, and would also use the extra option PCRE2_EXTRA_ASCII_BSD that is planned for 10.43 PCRE2. This would require changes to bleeding-edge pcre2grep -u (since it would need to add PCRE2_EXTRA_ASCII_BSD unless --no-ucp is also given), and to git grep -P (which would need to add PCRE2_UCP and PCRE2_EXTRA_ASCII_BSD, when libpcre2 is new enough to #define PCRE2_EXTRA_ASCII_BSD). This option works for me as well. In fact it's the least work for me since I already implemented it in bleeding-edge GNU grep (so it works this way already :-).
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 19:40:43 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 15:40:43 2023 Received: from localhost ([127.0.0.1]:52689 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pk8zr-0006je-7L for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:40:43 -0400 Received: from mail-lj1-f174.google.com ([209.85.208.174]:40749) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <meyering@HIDDEN>) id 1pk8zm-0006jM-2T for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:40:42 -0400 Received: by mail-lj1-f174.google.com with SMTP id s20so18116840ljp.7 for <60690 <at> debbugs.gnu.org>; Wed, 05 Apr 2023 12:40:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680723632; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wq98D7+2AkYfDLkf1OqLxyRym/GbLikvkppAQQaLhwc=; b=MDPpE0NqlWRz5EDQXnGR/f+n/9LtVDPcAGRkfAB52JB+Q891RPQA2kezYdzUJJ6QDZ +0ROypcNDLbrf+HgSN16na6XCErFFOJOE++VE6tvmILsg2HIQ41WQ267ttpHAqj7J06g ybfI7AdfdcN2zMZ8PrIwSgVhDd0CNys2FshXcs4xH4nAVUh3gzqDK0oa+IPRBO7RRWmu 0kp4bDeK8WVfwVh6Jn6xR7HhdJw0qmP7NKyYPQAEVPX9F3N2RKrwV3PPthn7zEH//KOZ GsIJOG5Z3JktNDD+vwZPUFDuva1nV156CJscnKYFgRiJhkPXWIS/9YcK4Z9J4mp89gmK esLQ== X-Gm-Message-State: AAQBX9c25CN3rK19fvNSrDv5l7pKF8JB9zqXZDmBQpwokUGSUCl9QL5i RMiUOi25jtaK+jvcYUN8+/EXGraDEngJAxw1GCY= X-Google-Smtp-Source: AKy350Y7hXDqiTw4u77IrwBslHwTeFnWZ3TuHB0VTmrcrUtuVprIqOAtICyO5QtbQuEFutlYez2HWzx2n69d1vhbQNw= X-Received: by 2002:a2e:a401:0:b0:295:acea:5875 with SMTP id p1-20020a2ea401000000b00295acea5875mr1410079ljn.2.1680723631858; Wed, 05 Apr 2023 12:40:31 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> In-Reply-To: <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> From: Jim Meyering <jim@HIDDEN> Date: Wed, 5 Apr 2023 12:40:18 -0700 Message-ID: <CA+8g5KHYqgAZPpTOXWekDpWv-mvj-rBkGu+4MXy4OB1VDeS4Lw@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 3.3 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Wed, Apr 5, 2023 at 11:33 AM Paul Eggert <eggert@HIDDEN> wrote: > On 2023-04-04 12:31, Junio C Hamano wrote: > > My personal inclination is to let Perl folks decide > > and follow them (even [...] Content analysis details: (3.3 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 3.0 MANY_TO_CC Sent to 10+ recipients 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (meyering[at]gmail.com) 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.208.174 listed in wl.mailspike.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [209.85.208.174 listed in list.dnswl.org] 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 T_SPF_TEMPERROR SPF: test of record failed (temperror) 0.0 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, Junio C Hamano <gitster@HIDDEN>, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 2.2 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Wed, Apr 5, 2023 at 11:33 AM Paul Eggert <eggert@HIDDEN> wrote: > On 2023-04-04 12:31, Junio C Hamano wrote: > > My personal inclination is to let Perl folks decide > > and follow them (even [...] Content analysis details: (2.2 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.208.174 listed in wl.mailspike.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [209.85.208.174 listed in list.dnswl.org] 3.0 MANY_TO_CC Sent to 10+ recipients 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (meyering[at]gmail.com) 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different -1.0 MAILING_LIST_MULTI Multiple indicators imply a widely-seen list manager On Wed, Apr 5, 2023 at 11:33=E2=80=AFAM Paul Eggert <eggert@HIDDEN> wr= ote: > On 2023-04-04 12:31, Junio C Hamano wrote: > > My personal inclination is to let Perl folks decide > > and follow them (even though I am skeptical about the wisdom of > > letting '\d' match anything other than [0-9]) > > I looked into what pcre2grep does. It has always done only 8-bit > processing unless you use the -u or --utf option, so plain "pcre2grep > '\d'" matches only ASCII digits. > > Although this causes pcre2grep to mishandle Unicode characters: > > $ echo '=C3=86var' | pcre2grep '[Ss=C3=9F]' > =C3=86var > > it mimics Perl 5.36: > > $ echo '=C3=86var' | perl -ne 'print $_ if /[Ss=C3=9F]/' > =C3=86var > > so this seems to be what Perl users expect, despite its infelicities. > > For better Unicode handling one can use pcre2grep's -u or --utf option, > which causes pcre2grep to behave more like GNU grep -P and git grep -P: > "echo '=C3=86var' | pcre2grep -u '[Ss=C3=9F]'" outputs nothing, which I t= hink is > what most people would expect (unless they're Perl users :-). Good argument for making PCRE2_UCP the default. > Neither git grep -P nor the current release of pcre2grep -u have \d > matching non-ASCII digits, because they do not use PCRE2_UCP. However, > in a February 8 commit[1], Philip Hazel changed pcre2grep to use > PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU > grep -P did (though 3.10 has changed this). > > That February commit also added a --no-ucp option, to disable PCRE2_UCP. > So as I understand it, if you're in a UTF-8 locale: > > * 10.43 pcre2grep -u will behave like 3.9 GNU grep -P. > > * 10.43 pcre2grep -u --no-ucp will behave like git grep -P. > > * Current GNU grep -P is different from everybody else. > > This incompatibility is not good. > > Here are two ways forward to fix this incompatibility (there are other > possibilities of course): > > (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep > --no-ucp, and git grep -P follows suit. That is, both GNU and git grep > act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also > enable PCRE2_UCP unless --no-ucp is given. This would cause \d to match > non-ASCII digits unless --no-ucp is given. > > (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and --no-ucp. > That is, they would both do 8-bit-only by default, and use PCRE2_UTF > only when -u or --utf is given, and use PCRE2_UCP only when --no-ucp is > absent. This would cause \d to match non-ASCII digits only when -u is > given but --no-ucp is not. Changing grep -P's \d to match multibyte digits by default would break an important contract. Avoiding that feels like it must outweigh any cross-tool portability concern. (C) preserve grep -P's tradition of \d matching only 0..9, and once grep uses 10.43 or newer, \b and \w will also work as desired. > Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git grep > -P would be consistent. I hope git grep -P's \d will also stick to ASCII-only by default. Those rare few who desire multibyte matches can always specify \p{Nd} instead of \d, or (with new enough PCRE2), use (?-aD) and (?aD) to toggle the digit-matching mode.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 19:37:58 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 15:37:57 2023 Received: from localhost ([127.0.0.1]:52683 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pk8xB-0006ek-Kk for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:37:57 -0400 Received: from mail-pj1-f42.google.com ([209.85.216.42]:43918) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <jch2355@HIDDEN>) id 1pk8xA-0006eX-Eq for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:37:57 -0400 Received: by mail-pj1-f42.google.com with SMTP id lr16-20020a17090b4b9000b0023f187954acso38353189pjb.2 for <60690 <at> debbugs.gnu.org>; Wed, 05 Apr 2023 12:37:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680723470; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:sender:from:to:cc :subject:date:message-id:reply-to; bh=sTfI/fZISXlpPkW/OBzm8tUQj2Dur1rRA4qGlvG70uM=; b=f1+JvFrG4DD6jt/4TOFG1ZgMBT/x/hxOk2oMMZjC09iK/Hc+lFbYdyMimmRxjKxtFl wq8+Mu9yV/Sg7igX9TIQ1BCyitEJWo4XyX0caxB/JilRdUhH16GSXEycfie/ouc1DUxO Cp3ooHyOM/Zgg69byygJmL7kBrYxawNALNTyEiuJi/fOsX1WPoXIO+p1Vx7T9FzUVm2s 3vYKG0hzRKSfZdGMvHSW0K7m3YS8M+tAf1cHSBBKlCmVFklW3GII1tKLAqQgxyn5NLtj h3VgkIYDneuYMuA8G4miZl2CWEkhT6qGPQxbZxc7DjFZvtrK/XBJ+caX7EgCRV/E9/wk s0EA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680723470; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:sender :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sTfI/fZISXlpPkW/OBzm8tUQj2Dur1rRA4qGlvG70uM=; b=fZO3CI2IhW+IhHkcvO+EgmkKzkpyI66t00X89c5VxQeGpCpqp8NSQ0l+qWMZSkViGV WPatBjnjWS2TpR47DEwBu8zaf4y2UT/bHP3qiQoj0vCSAaSbDVyR9jxz2QY4OCYoM9+d F2U+A7L/nOKAXetYy7+AZpWNh2gkDO2Y1Mdo5TPfOKYmoAlgphLTIVdZIH+Ueg4ygPBJ exINC3rFBqiwdjrp7JncJwN2bkAsxcvLtDAvM5rSlQ4hWBLnHJP85GLdOcuBiQKHFYmL gB0wxBgAp1te/KfWYApPfvoMiaqiiN/6wPOFaNdJbFCX3MM4zTx3RLBP8PaXTJe28IgS G+jQ== X-Gm-Message-State: AAQBX9dPJQkVl0aYn2q4xQYoKoGTccRoin4wbd04O9EEHbSdOOwbtOJk qOwRLTOL/Q9/5eLVairqiUA= X-Google-Smtp-Source: AKy350YXcN80SrkZsGvKU62pKhIdePWRYuIHZk0iDdgSIYaikoeedNI1AvNAT+UloODxJnzxnS4JGQ== X-Received: by 2002:a17:902:e884:b0:19a:a520:b203 with SMTP id w4-20020a170902e88400b0019aa520b203mr9034298plg.25.1680723470325; Wed, 05 Apr 2023 12:37:50 -0700 (PDT) Received: from localhost (254.80.82.34.bc.googleusercontent.com. [34.82.80.254]) by smtp.gmail.com with ESMTPSA id p10-20020a170902a40a00b001a01bb92273sm10432863plq.279.2023.04.05.12.37.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Apr 2023 12:37:49 -0700 (PDT) From: Junio C Hamano <gitster@HIDDEN> To: Paul Eggert <eggert@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> Date: Wed, 05 Apr 2023 12:37:49 -0700 In-Reply-To: <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> (Paul Eggert's message of "Wed, 5 Apr 2023 11:32:38 -0700") Message-ID: <xmqqlej6unle.fsf@HIDDEN> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@HIDDEN>, pcre-dev@HIDDEN, =?utf-8?Q?Tukusej=E2=80=99s?= Sirs <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.5 (/) Paul Eggert <eggert@HIDDEN> writes: > Here are two ways forward to fix this incompatibility (there are other > possibilities of course): > > (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep > --no-ucp, and git grep -P follows suit. That is, both GNU and git grep > act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also > enable PCRE2_UCP unless --no-ucp is given. This would cause \d to > match non-ASCII digits unless --no-ucp is given. > > (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and > --no-ucp. That is, they would both do 8-bit-only by default, and use > PCRE2_UTF only when -u or --utf is given, and use PCRE2_UCP only when > --no-ucp is absent. This would cause \d to match non-ASCII digits only > when -u is given but --no-ucp is not. > > Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git > grep -P would be consistent. > > I mildly prefer (B) but (A) would also work. (One advantage of (B) is > that it should be faster....) For "git grep -P", I would like to hear from Carlo and Ævar; I agree both (A) and (B) would be workable solutions, and have a slight preference on a solution that does not add more options that take only in effect when -P is given, simply because these options are cumbersome to document and explain, but that is a very minor point. Thanks.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 19:04:38 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 15:04:38 2023 Received: from localhost ([127.0.0.1]:52670 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pk8Qw-0005lZ-Bf for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:04:38 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:44430) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pk8Qt-0005lJ-Q2 for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 15:04:36 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 7CC873C09FA00; Wed, 5 Apr 2023 12:04:29 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id RQFYaT8pCWkd; Wed, 5 Apr 2023 12:04:29 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 3F3603C09FA02; Wed, 5 Apr 2023 12:04:29 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 3F3603C09FA02 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680721469; bh=A5QqC6pTJiuMWmHDn0PeTFbe/cing73GoNhYwwNMCXk=; h=Message-ID:Date:MIME-Version:From:To; b=Kshsl4GHwHgUO4jd/s3/C8xBzgZPwTnyInFsabrj1l3fg9P69ZaIKk7zsHF+Osx4T dWG6MNbxOt5BTc6303ooOhQIXsrR19858BZzNjyoaeZRYyU5ilMjeXg2T9NV0C5uxP 029lCtfR6OAEJy6TmMSpAQWknN76bRdRniFGNnGLhw6TIkkcOj96qsqdGLTMyKard0 TVvdCmmbCgetrxK/5njbSxnEEVn/beLt86vpVKmbDGhWBNuIuPZOIenjrjsTKjPf6q vVl192QJ+rjcuGPxW9/M5JqLBt20Tk+YvP6aDRN1KnijKorU69GPfgEqSaHZBRH5kH HYuFDvX6+NF8Q== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id IigyzgVcjc-m; Wed, 5 Apr 2023 12:04:29 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id EA1B33C09FA00; Wed, 5 Apr 2023 12:04:28 -0700 (PDT) Message-ID: <33b3eb15-73e2-8004-9f06-19e5ec5c5877@HIDDEN> Date: Wed, 5 Apr 2023 12:04:28 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US From: Paul Eggert <eggert@HIDDEN> To: Junio C Hamano <gitster@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-05 11:32, Paul Eggert wrote: > in a February 8 commit[1], Philip Hazel changed pcre2grep to use > PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU > grep -P did (though 3.10 has changed this). Sorry, due to fumblefingers I gave the wrong URL for [1]. Here's a corrected URL: https://github.com/PCRE2Project/pcre2/commit/8385df8c97b6f8069a48e600c7e4e94cc3e3ebd9 It also mentions a new --case-restrict option, intended for 10.43 pcre2grep. Given Perl's and PCRE2's plethora of options I suppose one could imagine several other options of that ilk.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 5 Apr 2023 18:32:48 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed Apr 05 14:32:47 2023 Received: from localhost ([127.0.0.1]:52628 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pk7w7-0004hN-Hd for submit <at> debbugs.gnu.org; Wed, 05 Apr 2023 14:32:47 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:48256) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pk7w5-0004h8-Ku for 60690 <at> debbugs.gnu.org; Wed, 05 Apr 2023 14:32:46 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 76A843C09FA00; Wed, 5 Apr 2023 11:32:39 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id lgOUApS0xxtS; Wed, 5 Apr 2023 11:32:39 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 2221F3C09FA03; Wed, 5 Apr 2023 11:32:39 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 2221F3C09FA03 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680719559; bh=bRYhVoaglb+oHA8EWDW1Fqok6GB2iW6NSs3OaO/A+wY=; h=Message-ID:Date:MIME-Version:To:From; b=ddPN/A9we4BUHk9jtuP3e/d7yYjuzKL+mI1OKkx4N7NuG6ksZPWNUvHGSbLckV00L fLV9O7BD0cvUUSIjRGHGHSSMCwYTWdutl0Ed8r9ye/L7asLcyjEcy77TrD62thRZXQ Op/3jqZAHJcJsSbP8ejjthm4wXf9u7OkNgr2F5yDYT0JWT7DOiW0auT6SAYXu234CV G/0KL1akOCluJBnF2XtLl+t6ftlrAIzfX48LqlSEI129fMNS/D/DPqJeM4nEHD595X 2HI54efP8ArWRj//LA2JVhJaYfDJAk8qP/4ga/l0n5XJCPagkY6mjgfDMJ9Ooxj4A4 ArTlLflXvcBtw== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id L7eQ5AvG5bMC; Wed, 5 Apr 2023 11:32:39 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id D20763C09FA00; Wed, 5 Apr 2023 11:32:38 -0700 (PDT) Message-ID: <6d86214a-1b80-eb88-1efb-36e61fd3203e@HIDDEN> Date: Wed, 5 Apr 2023 11:32:38 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: Junio C Hamano <gitster@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> <xmqqttxvzbo8.fsf@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <xmqqttxvzbo8.fsf@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, Philip.Hazel@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-04 12:31, Junio C Hamano wrote: > My personal inclination is to let Perl folks decide > and follow them (even though I am skeptical about the wisdom of > letting '\d' match anything other than [0-9]) I looked into what pcre2grep does. It has always done only 8-bit=20 processing unless you use the -u or --utf option, so plain "pcre2grep=20 '\d'" matches only ASCII digits. Although this causes pcre2grep to mishandle Unicode characters: $ echo '=C3=86var' | pcre2grep '[Ss=C3=9F]' =C3=86var it mimics Perl 5.36: $ echo '=C3=86var' | perl -ne 'print $_ if /[Ss=C3=9F]/' =C3=86var so this seems to be what Perl users expect, despite its infelicities. For better Unicode handling one can use pcre2grep's -u or --utf option,=20 which causes pcre2grep to behave more like GNU grep -P and git grep -P:=20 "echo '=C3=86var' | pcre2grep -u '[Ss=C3=9F]'" outputs nothing, which I t= hink is=20 what most people would expect (unless they're Perl users :-). Neither git grep -P nor the current release of pcre2grep -u have \d=20 matching non-ASCII digits, because they do not use PCRE2_UCP. However,=20 in a February 8 commit[1], Philip Hazel changed pcre2grep to use=20 PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU=20 grep -P did (though 3.10 has changed this). That February commit also added a --no-ucp option, to disable PCRE2_UCP.=20 So as I understand it, if you're in a UTF-8 locale: * 10.43 pcre2grep -u will behave like 3.9 GNU grep -P. * 10.43 pcre2grep -u --no-ucp will behave like git grep -P. * Current GNU grep -P is different from everybody else. This incompatibility is not good. Here are two ways forward to fix this incompatibility (there are other=20 possibilities of course): (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep=20 --no-ucp, and git grep -P follows suit. That is, both GNU and git grep=20 act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also=20 enable PCRE2_UCP unless --no-ucp is given. This would cause \d to match=20 non-ASCII digits unless --no-ucp is given. (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and --no-ucp.=20 That is, they would both do 8-bit-only by default, and use PCRE2_UTF=20 only when -u or --utf is given, and use PCRE2_UCP only when --no-ucp is=20 absent. This would cause \d to match non-ASCII digits only when -u is=20 given but --no-ucp is not. Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git grep=20 -P would be consistent. I mildly prefer (B) but (A) would also work. (One advantage of (B) is=20 that it should be faster....) [1]:=20 https://github.com/PCRE2Project/pcre2/commit/8385df8c97b6f8069a48e600c7e4= e94cc3e3ebd9ht
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 19:32:01 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 15:32:01 2023 Received: from localhost ([127.0.0.1]:50254 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjmNt-0005YJ-9l for submit <at> debbugs.gnu.org; Tue, 04 Apr 2023 15:32:01 -0400 Received: from mail-pl1-f176.google.com ([209.85.214.176]:40570) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <jch2355@HIDDEN>) id 1pjmNq-0005Xz-Dc for 60690 <at> debbugs.gnu.org; Tue, 04 Apr 2023 15:31:59 -0400 Received: by mail-pl1-f176.google.com with SMTP id u10so32318069plz.7 for <60690 <at> debbugs.gnu.org>; Tue, 04 Apr 2023 12:31:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680636712; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:sender:from:to:cc:subject:date:message-id :reply-to; bh=6u+YUnimTHNqsVP7F3dghDPdmr+RHhSOP18RibhhSu8=; b=P0SA8CYKyvL8Q0TRQ9El2yEbBx7T3paBdY7iEvLp+1f9C5AP9YUhWNh5RPCvOjVzpZ 2Jn9G/vaXknKDJ37lrtH8T7HXNWs/Syi8Y3vt1pTm80K2s34qgwSRdSkEWAqs/0YXy75 kZQATSCd+nKm/Gp4pqmpE80Onyds2cSsGFLcU/zFGDGyBPiW6/Oe93oRIGhWDnTrSRq9 2lO8qmyoaaH+FVXIuS+nhXbjx+772OswBd3q6Pgvxrlr7zxtFEihe0kPeVMvlEolH7C1 bdjlriax5kfz6mm8kaUy10CEB0LWseVihV7NINZtCPduKJAJP2FhoRAveu2UaslSyhcQ WA8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680636712; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:sender:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=6u+YUnimTHNqsVP7F3dghDPdmr+RHhSOP18RibhhSu8=; b=38mxy/+8zX2X0UNBQYB2VEs0r1bK1v8Spu/ldRgA23gUHKSoiQcRbZaciMUwSnS78E PL+3bFUJC/LdX7yGlViqsgBpBSjkbSpRA60UNDmoWUOK+LQuh2sBYTN5VfKbLtQITLA4 8YVJQx+pi10IDdNIAUlACK5ru76vxN71jg1AfYwX6vpfs2pgURyGhbIrX1pHlu10x7o5 tGPO+1fe4agKHYLnUXaalvl3wtJFnspKmK++gkoRofbS6J3OVuMIm0TNT1Hx3r5xRoV4 /nWm5zMtvmN3D9wPlOjYYVvQWPG7nfXCUof9CgBzUeDFiZQ+8DR2/5qAxkWDyAR+g5bn +f8g== X-Gm-Message-State: AAQBX9fApm75vCLqVpWjtEHQVfoCUSSFcrJGHHVsLNEfaNoDA1auI2hk ZLQ6HmszTKjSvVATci9NHbg= X-Google-Smtp-Source: AKy350btd5q4jt7/5/l0+bfz8VRalTaP/tRnSwu4zFULoE+u6flcSrAWNJc8KxHe7hKmdCnkPhiEaQ== X-Received: by 2002:a05:6a20:619a:b0:e3:9d4e:b340 with SMTP id x26-20020a056a20619a00b000e39d4eb340mr3460394pzd.12.1680636711819; Tue, 04 Apr 2023 12:31:51 -0700 (PDT) Received: from localhost (254.80.82.34.bc.googleusercontent.com. [34.82.80.254]) by smtp.gmail.com with ESMTPSA id e5-20020a62ee05000000b005e099d7c30bsm8971154pfi.205.2023.04.04.12.31.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Apr 2023 12:31:51 -0700 (PDT) From: Junio C Hamano <gitster@HIDDEN> To: Paul Eggert <eggert@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> Date: Tue, 04 Apr 2023 12:31:51 -0700 In-Reply-To: <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> (Paul Eggert's message of "Tue, 4 Apr 2023 11:25:59 -0700") Message-ID: <xmqqttxvzbo8.fsf@HIDDEN> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, Carlo Arenas <carenas@HIDDEN>, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@HIDDEN>, pcre-dev@HIDDEN, =?utf-8?Q?Tukusej=E2=80=99s?= Sirs <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.5 (/) Paul Eggert <eggert@HIDDEN> writes: > This is an evolving area. Git master is fiddling with flags and > options, and so is GNU grep master, and so is PCRE2, and there are > bugs. If you're running bleeding-edge versions of this code you'll get > different behavior than if you're running grep 3.8, pcregrep 8.45, > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has). > > What I'm fearing is that we may evolve into mutually incompatible > interpretations of how Perl regular expressions deal with UTF-8 > text. That'd be a recipe for confusion down the road. Nicely said. My personal inclination is to let Perl folks decide and follow them (even though I am skeptical about the wisdom of letting '\d' match anything other than [0-9]), but even in Git circle there would be different opinions, so I am glad that the discussion is visible on the list to those who are intrested.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 18:26:12 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 14:26:12 2023 Received: from localhost ([127.0.0.1]:50179 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjlMB-0001CS-Q4 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2023 14:26:12 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:39920) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pjlM7-0001Bs-BR for 60690 <at> debbugs.gnu.org; Tue, 04 Apr 2023 14:26:10 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 999673C09FA04; Tue, 4 Apr 2023 11:26:00 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 95-7Sq2pH7TB; Tue, 4 Apr 2023 11:26:00 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 57B053C09FA08; Tue, 4 Apr 2023 11:26:00 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 57B053C09FA08 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680632760; bh=FYW4ECUW8MUCylXzUhT21AzRYAuNwByZMArOjfDZ+6A=; h=Message-ID:Date:MIME-Version:To:From; b=HYhxH1aNvKIJjxXiUlUK7/K6lLKOMOLMCVz/ZSwngWyMpluC1Reh+RB6VOYjmggNR uVCF237LceyP1veG3HZBJr3Rfe5S7tojXIi3WOP3BqEHg8j2frOmu1x87sacvnNBWl InfUvQzVvzvyr4FmPv4dkUtzsIs0IKRQ45/0CK/1c3D4NcVPBLv5oRYKqEIDOaitkc UCoUe9JG5vKjI8QxiD+fQIBIPNXxMyqVbFIjr42xitME9nYQ+g1uRYh7xOkCsCiXd0 Xt3yrvgUGMGnYqIyCusnaGmuWuVqhld4j4DhdFQNhkpP91bjYE7LOnCWos3JONtGEd 2/9x9rz3yJnEg== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id H9BFLE10cWjv; Tue, 4 Apr 2023 11:26:00 -0700 (PDT) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 2DB283C09FA04; Tue, 4 Apr 2023 11:26:00 -0700 (PDT) Message-ID: <96358c4e-7200-e5a5-869e-5da9d0de3503@HIDDEN> Date: Tue, 4 Apr 2023 11:25:59 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Subject: Re: bug#60690: -P '\d' in GNU and git grep Content-Language: en-US To: Carlo Arenas <carenas@HIDDEN> References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department In-Reply-To: <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, gitster@HIDDEN, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 4/3/23 23:56, Carlo Arenas wrote: > On Mon, Apr 3, 2023 at 2:38=E2=80=AFPM Paul Eggert <eggert@HIDDEN>= wrote: >> >> on March 23 Git disabled >> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug >> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should >> appear in the next PCRE2 release. >=20 > Presume PCRE2 is a typo and should have been "git" here? No, I was talking about what options Git uses when it calls PCRE2=20 functions. In other words, this is about whether GNU 'grep -P' should be=20 compatible with 'git grep -P' (as well as with Perl and with pcregrep),=20 when interpreting \d and similar constructs. This is an evolving area. Git master is fiddling with flags and options,=20 and so is GNU grep master, and so is PCRE2, and there are bugs. If=20 you're running bleeding-edge versions of this code you'll get different=20 behavior than if you're running grep 3.8, pcregrep 8.45, Perl 5.36, and=20 git 2.39.2 (which is what Fedora 37 has). What I'm fearing is that we may evolve into mutually incompatible=20 interpretations of how Perl regular expressions deal with UTF-8 text.=20 That'd be a recipe for confusion down the road.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 15:32:06 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 11:32:06 2023 Received: from localhost ([127.0.0.1]:50022 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjidh-000437-Qs for submit <at> debbugs.gnu.org; Tue, 04 Apr 2023 11:32:06 -0400 Received: from mail-lf1-f41.google.com ([209.85.167.41]:42513) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <meyering@HIDDEN>) id 1pjidf-00042Z-GF for 60690 <at> debbugs.gnu.org; Tue, 04 Apr 2023 11:32:04 -0400 Received: by mail-lf1-f41.google.com with SMTP id g19so29735047lfr.9 for <60690 <at> debbugs.gnu.org>; Tue, 04 Apr 2023 08:32:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680622317; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wnAS7PnZ+qI47ad2cTwmspqJ/bYge73O/i5SR9X3vUw=; b=CieOdYctcadTYiE5v1YzIhjjQNykINWkd3lfEs6pYleJblKoEaaN4hrFx8Fjjqsbla Dky7EXhHv3YEb4pqSoXhxXhbS5h5Uk52t2sxNzCH4FR0fM+Wy9KWZeGmkSodgj5vb1Mk GhgeWDKOrYnNRO3IKvPNK8UETnq72PCohlw9ZirI+kLPVKQQLHo1EWwPW6HPFp5Z/IHA Y4guuD3n6JJ6ghcf879cOBfYRTPCg3SoOFdUBt70r+xfjfvGpIxah2aLymfFCZbuVFi8 2GcVyFd3LISYy5hyV/XSjgdBwsZnsasmZrJQ72pcisEaKoNU+4qoFGstsohNq8Kyqy7A keoA== X-Gm-Message-State: AAQBX9dYacTbIb7iheG1YvctDhs3TtIIqS8HcVrNfBWppmMZYkiWMGzD ssKS8M6wyr2lbP+nkzRGXl5JvambAopp5+X/k+o= X-Google-Smtp-Source: AKy350aQK+9RCaSofpCjRENW8HmYTkll3IqdCd7d/eG1RmpuOsmcEPOT7NDhkJJdYa25UePBK2pfpReUFNKhcodRJdE= X-Received: by 2002:a19:ad09:0:b0:4d5:ca32:6aea with SMTP id t9-20020a19ad09000000b004d5ca326aeamr879865lfc.10.1680622317044; Tue, 04 Apr 2023 08:31:57 -0700 (PDT) MIME-Version: 1.0 References: <230109.86v8lf297g.gmgdl@HIDDEN> <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CA+8g5KHuE-kQqmi9cVjeJbpyt54v9m9omh9A9we1zmR0+aTDHg@HIDDEN> <920dcc8d-9e45-a03e-af06-6b420c6e0f81@HIDDEN> In-Reply-To: <920dcc8d-9e45-a03e-af06-6b420c6e0f81@HIDDEN> From: Jim Meyering <jim@HIDDEN> Date: Tue, 4 Apr 2023 08:31:44 -0700 Message-ID: <CA+8g5KGGnsf0xMCXO28R1m8-z76=kG_AiYRh6=OgRL+x5C1yqQ@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, gitster@HIDDEN, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.8 (/) On Mon, Apr 3, 2023 at 11:47=E2=80=AFPM Paul Eggert <eggert@HIDDEN> wr= ote: > On 2023-04-03 20:30, Jim Meyering wrote: > > have you seen justification > > (other than for compatibility with some other tool or language) for > > allowing \d to match non-ASCII by default, in spite of the risks? > > In the example =C3=86var supplied in <https://bugs.gnu.org/60690>, my > impression was that it was better when \d matched non-ASCII digits. That > is, in a UTF-8 locale it's better when \d finds matches in these lines: > > >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=91=E8=A1=8C: =E4=BD=95=E3= =82=92=E3=81=97=E3=81=9F=E3=81=8B=E3=80=81=E3=82=92=EF=BC=91=E8=A1=8C=E3=81= =A7=E8=A6=81=E7=B4=84=E3=80=82\n" > >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=92=E8=A1=8C: =E7=A9=BA=E7= =99=BD\n" > > because they contain the Japanese digits "=EF=BC=91" and "=EF=BC=92". Thi= s was the only > example I recall being given. Before it was unintentionally enabled in grep-3.9, lines like that have never been matched by grep -P's '\d'. By relaxing \d, we'd weaken any application that uses say grep -P '^\d+$' to perform input validation intending to ensure that some input is all ASCII digits. It's not a big stretch to imagine that some downstream processor of that "verified" data is not prepared to deal with multi-byte digits. > Also, I find it odd that grep -P '^[\w\d]*$' matches lines containing > any sort of Arabic word characters, but it rejects lines containing > Arabic digits like "=D9=A3" that are perfectly reasonable in Arabic-langu= age > text. I also find it odd that [\d] and [[:digit:]] mean different things. > > There are arguments on the other side, otherwise we wouldn't be having > this discussion. And it's true that grep -P '\d' formerly rejected > Arabic digits (though it's also true that grep -P '\w' formerly rejected > Arabic letters...). Still, the cure's oddness and incompatibility with > Git, Perl, etc. appears to me to be worse than the disease of dealing > with grep -P invocations that need to use [0-9] or LC_ALL=3D"C" anyway if > they want to be portable to any program other than GNU grep. I'm primarily concerned about not introducing a persistent regression in how GNU grep's -P '\d' works in multibyte locales. The corner cases you mention do matter, of course, but are far less likely to matter in practice= .
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 06:57:13 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 02:57:13 2023 Received: from localhost ([127.0.0.1]:46389 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjabR-0005jG-H0 for submit <at> debbugs.gnu.org; Tue, 04 Apr 2023 02:57:13 -0400 Received: from mail-wm1-f53.google.com ([209.85.128.53]:33654) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <carenas@HIDDEN>) id 1pjabP-0005j2-DK for 60690 <at> debbugs.gnu.org; Tue, 04 Apr 2023 02:57:11 -0400 Received: by mail-wm1-f53.google.com with SMTP id v20-20020a05600c471400b003ed8826253aso442870wmo.0 for <60690 <at> debbugs.gnu.org>; Mon, 03 Apr 2023 23:57:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680591425; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=L+z/XBfmEv3ffS4KRALIKDiiP+nzelhlbDnf7JG/2/I=; b=Ze/kUqG1BDVhp1NkyRxsa/Sz5NYonV2JetzPVy1toCgdoSwnGFOi6SQTTuEyOTMwfz Za3jS1ahSOYwF0IL7PrRmRf4y5hk3AJkNzK64/m1SxIOEEAXHQKz/4s8D0dSQrwr/w1p 1vwwN2h5nQzQUOezT8K1n9TGi89eSeD4gF/YXYRMGWQVbvqUbWvgZOilDOkgat8Kdmd1 qinheoZJ7UV1x9GfL1TwUpIbBHC9+gfjuHNA1w7DDAMPJcuQOqDi1C5coOKVBziS49Wa tjPUAeZO/g4N/Kk93xXtpRwR8eH0xNWMHRZVLjaksTWkyAEtg/vLty+oX/o2eLpod49V pH8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680591425; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=L+z/XBfmEv3ffS4KRALIKDiiP+nzelhlbDnf7JG/2/I=; b=V8kcY8tpgqjAVlmbtOvytw32LJl6gm3p0pxUKDv0o3YOgzeHxj0LquEvTDUpV/00OP sYQjwrAxmNUrymI2wmPKNSwvy/01Z8yoewSgbXgz6jAYxCHTZEt3v74JpjtS/gV8Qdc4 44QQ97l+WTaQA357GC+JnYUQIPTJBWRhXQWe4VNx0Cg7Rnb/MZ1zOzvwEav8vETOQFji /csfGMevk9HM3yduvSb6hCtB/kynfxMKleCZyWhKdY41XoQvmqEaVcemC3GC76fwCslN z/0KrhurCFEh4pbPzFzM6purmBI75AaV1C6YD+MdcFWu2rBqVKXXm+tl8TdF7oWVrieF 5fOQ== X-Gm-Message-State: AAQBX9fgrCX4g+ullqgLcnHmdJbjNBj505IqXGblOH2/Hqbh9uPqA+ir dbd+Uz6XsE6X35CE4C6ItZMXBVyci0vUa32A9M0= X-Google-Smtp-Source: AKy350bAChGT+QnAeagE/UnDfrmPXehjEhK9ou4j99vXGkjQ8Wy0ipglwG7z086+Es08phDNXXTYt68GbK8Kvg8EzJc= X-Received: by 2002:a05:600c:2202:b0:3ed:6979:3ab with SMTP id z2-20020a05600c220200b003ed697903abmr464916wml.4.1680591425352; Mon, 03 Apr 2023 23:57:05 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> In-Reply-To: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> From: Carlo Arenas <carenas@HIDDEN> Date: Mon, 3 Apr 2023 23:56:54 -0700 Message-ID: <CAPUEspj1m6F0_XgOFUVaq3Aq_Ah3PzCUs7YUyFH9_Zz-MOYTTA@HIDDEN> Subject: Re: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, gitster@HIDDEN, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) On Mon, Apr 3, 2023 at 2:38=E2=80=AFPM Paul Eggert <eggert@HIDDEN> wro= te: > > In researching this a bit further, I found that on March 23 Git disabled > the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug > that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should > appear in the next PCRE2 release. Presume PCRE2 is a typo and should have been "git" here? FWIW the PCRE2 fix[1] has been released already with 10.35 and backporting to the Ubuntu 20.04 package that crashed in the original report would also solve the crash with 10.34. Carlo [1] https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 06:47:06 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Apr 04 02:47:06 2023 Received: from localhost ([127.0.0.1]:46385 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjaRe-0005Fc-BZ for submit <at> debbugs.gnu.org; Tue, 04 Apr 2023 02:47:06 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:50556) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pjaRc-0005F8-7M for 60690 <at> debbugs.gnu.org; Tue, 04 Apr 2023 02:47:04 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id B809F3C09FA06; Mon, 3 Apr 2023 23:46:58 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id dRGahyowkSL6; Mon, 3 Apr 2023 23:46:58 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 50CD53C09FA08; Mon, 3 Apr 2023 23:46:58 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 50CD53C09FA08 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680590818; bh=OPKGbZHEgoMi8NRsw8a0G2vgxL3IygYevxOn970mno8=; h=Message-ID:Date:MIME-Version:To:From; b=SzxiOLdccdooebFoT64zRlOK1yMXrRpaZZTb6G913crnXjlaCR9GtOg7enjqjrHRK dK4wVg8cHY1IiOyaP/vEPnM4n7aMZUXu40/SZ6H7gTZbJKQH7qECPXC1QoDbvfyAmo WKPy28P+4mI1eAGN0UlagJWtNNmlB7GcLnMm2nGGfV3DqqKcCzgBxnLp94pzrMBDj/ c+U9KvQtBHXCFeOhE/X8i+HyltGsZr23a8t2L0f9bA8EQezx+KHCSB4dVoB4MO+up7 iOvwIOSimA/sZl71UWJUTYE3VlF3UmpKgP1Afd7UILJ8urXbEbEA0JjTBbi011K/C1 ycpJq+AqoDfJg== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id V-Qx7gZwcg3n; Mon, 3 Apr 2023 23:46:58 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 0AB563C09FA06; Mon, 3 Apr 2023 23:46:58 -0700 (PDT) Message-ID: <920dcc8d-9e45-a03e-af06-6b420c6e0f81@HIDDEN> Date: Mon, 3 Apr 2023 23:46:57 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: Jim Meyering <jim@HIDDEN> References: <230109.86v8lf297g.gmgdl@HIDDEN> <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> <CA+8g5KHuE-kQqmi9cVjeJbpyt54v9m9omh9A9we1zmR0+aTDHg@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: <CA+8g5KHuE-kQqmi9cVjeJbpyt54v9m9omh9A9we1zmR0+aTDHg@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.1 (-) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=c3=b3n?= <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, gitster@HIDDEN, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.1 (--) On 2023-04-03 20:30, Jim Meyering wrote: > have you seen justification > (other than for compatibility with some other tool or language) for > allowing \d to match non-ASCII by default, in spite of the risks? In the example =C3=86var supplied in <https://bugs.gnu.org/60690>, my=20 impression was that it was better when \d matched non-ASCII digits. That=20 is, in a UTF-8 locale it's better when \d finds matches in these lines: >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=91=E8=A1=8C: =E4=BD=95=E3=82=92= =E3=81=97=E3=81=9F=E3=81=8B=E3=80=81=E3=82=92=EF=BC=91=E8=A1=8C=E3=81=A7=E8= =A6=81=E7=B4=84=E3=80=82\n" >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=92=E8=A1=8C: =E7=A9=BA=E7=99=BD= \n" because they contain the Japanese digits "=EF=BC=91" and "=EF=BC=92". Thi= s was the only=20 example I recall being given. Also, I find it odd that grep -P '^[\w\d]*$' matches lines containing=20 any sort of Arabic word characters, but it rejects lines containing=20 Arabic digits like "=D9=A3" that are perfectly reasonable in Arabic-langu= age=20 text. I also find it odd that [\d] and [[:digit:]] mean different things. There are arguments on the other side, otherwise we wouldn't be having=20 this discussion. And it's true that grep -P '\d' formerly rejected=20 Arabic digits (though it's also true that grep -P '\w' formerly rejected=20 Arabic letters...). Still, the cure's oddness and incompatibility with=20 Git, Perl, etc. appears to me to be worse than the disease of dealing=20 with grep -P invocations that need to use [0-9] or LC_ALL=3D"C" anyway if= =20 they want to be portable to any program other than GNU grep.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 4 Apr 2023 03:30:23 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 03 23:30:23 2023 Received: from localhost ([127.0.0.1]:46176 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjXNH-0004Vu-DC for submit <at> debbugs.gnu.org; Mon, 03 Apr 2023 23:30:23 -0400 Received: from mail-lf1-f52.google.com ([209.85.167.52]:45921) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <meyering@HIDDEN>) id 1pjXNF-0004VV-Mt for 60690 <at> debbugs.gnu.org; Mon, 03 Apr 2023 23:30:22 -0400 Received: by mail-lf1-f52.google.com with SMTP id bi9so40607786lfb.12 for <60690 <at> debbugs.gnu.org>; Mon, 03 Apr 2023 20:30:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680579015; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RxLvDywim81BDoQlVW2zTMwgbTw8PpwLQvG0B7AxiqM=; b=44XnBW/jFZsS6NNoSxgDlIODu1TXcsi0haFm2msgi+cwgiVGrII+ja8Ns/dsNav0f8 dHucp5wMJlHO/Sv9shu/MbQtRFVmiquZ49yL1aQ32Q3lpbVin8j5bUv0seyEbzFHK6lv wnCXqdRoPWK4G6Es4zzHe+VwA1J3ePQpUszQt4yZQUsSpC0ljLrAIiYwXjjaxXxnYvGG ZOqbzXZyEEhtcFpbvFCtEjbSm3Sig2h/M1sYLnFDfWR5zshKIT1da3Rl1DUsKusiXti+ ayAV96JtP9eo3cS+qIv/xRfBg9lTmw0a7BElkqLtpTjnU3NIq6DEmMV+HEzoAcXq+Jx5 O1uQ== X-Gm-Message-State: AAQBX9ceov84q9xb1HN17PHoyQOybdKF+tKAqjcNYZ7e0xSXhuGRwnfs oWTKEO1Y9Z9FaZu3AGV/1xb6YD3lzKctK2A04HE= X-Google-Smtp-Source: AKy350buA7U1HUotqrdLUjYZ0ef4wICYwOpKaxxNkWrBVNxWp/yfxQR+dKDIjFqxCeAk36IpJF1epjCYQUTvgBWJolc= X-Received: by 2002:a19:f607:0:b0:4d5:ca32:6ed5 with SMTP id x7-20020a19f607000000b004d5ca326ed5mr388039lfe.3.1680579015313; Mon, 03 Apr 2023 20:30:15 -0700 (PDT) MIME-Version: 1.0 References: <230109.86v8lf297g.gmgdl@HIDDEN> <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> In-Reply-To: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> From: Jim Meyering <jim@HIDDEN> Date: Mon, 3 Apr 2023 20:30:02 -0700 Message-ID: <CA+8g5KHuE-kQqmi9cVjeJbpyt54v9m9omh9A9we1zmR0+aTDHg@HIDDEN> Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Paul Eggert <eggert@HIDDEN> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01@HIDDEN>, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, gitster@HIDDEN, pcre-dev@HIDDEN, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.8 (/) On Mon, Apr 3, 2023 at 2:39=E2=80=AFPM Paul Eggert <eggert@HIDDEN> wro= te: > I've recently done some bug-report maintenance about a set of GNU grep > bug reports related to whether whether "grep -P '\d'" should match > non-ASCII digits, and have some thoughts about coordinating GNU grep > with git grep in this department. > > GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on > Savannah's copy of GNU grep, and some sort of fix should appear in the > next grep release. However, I'm leaving the GNU grep bug report open for > now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly > identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug > found in latest stable release v3.10 of grep". I merged these related > bug reports, and the oldest one, Bug#60690, is now the representative > displayed in the GNU grep bug list[4]. > > For this set of grep bug reports there's still a pending issue discussed > in my recent email[5], which proposes a patch so I've tagged Bug#60690 > with "patch". The proposal is that GNU grep -P '\d' should revert to the > grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match > non-ASCII decimal digits. > > In researching this a bit further, I found that on March 23 Git disabled > the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug > that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should > appear in the next PCRE2 release. > > When PCRE2 10.35 comes out, Thanks for finding that. It's clearly a good idea to disable PCRE2_UCP for those using those older, known-buggy versions of pcre2. The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases > it appears that 'git grep -P' will behave > like 'grep -P' only if GNU grep adopts something like the solution > proposed in [5]. > > [1]: https://bugs.gnu.org/62605 > [2]: https://bugs.gnu.org/60690 > [3]: https://bugs.gnu.org/62552 > [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=3Dgrep > [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html > [6]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac= 8 > [7]: > https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@HIDDEN= om/ > [8]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac= 8 Thanks for all of the links. However, have you seen justification (other than for compatibility with some other tool or language) for allowing \d to match non-ASCII by default, in spite of the risks? IMHO, we have an obligation to retain compatibility with how grep -P '\d' has worked since -P was added. I'd be happy to see an option to enable the match-multibyte-digits behavior, but making it the default seems too likely to introduce unwarranted risk.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 3 Apr 2023 21:38:56 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Apr 03 17:38:56 2023 Received: from localhost ([127.0.0.1]:45428 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pjRt9-0007xL-M4 for submit <at> debbugs.gnu.org; Mon, 03 Apr 2023 17:38:56 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:60202) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pjRt4-0007x1-Au for 60690 <at> debbugs.gnu.org; Mon, 03 Apr 2023 17:38:54 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 6EF8C3C097AFC; Mon, 3 Apr 2023 14:38:43 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 01c1HguO6g6L; Mon, 3 Apr 2023 14:38:43 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id F395C3C097AFD; Mon, 3 Apr 2023 14:38:42 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu F395C3C097AFD DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680557923; bh=mPk8IoktGpCOFvvbm6FfDFpRZjgCNQY0M93nZRbaYvA=; h=Message-ID:Date:MIME-Version:To:From; b=Mb6Ob0rNFkTbzDgf/zQA3lF1a3BgMs8pOKjlhZSCYQThLrVMFUCs/n0whkksKwcaV qSAmHwcLn7ult6QYNTe1YG9vbwz3iTL857NVLQ95BX7Z0d6WHxmdz1n725C5qXqCma zupJnWHHeOsnhbsLvcAk+jw4t3bzBibhqNqSVnh/UzHFuAnQL/azagvwM9/ioUKgV0 VXSBvWHBBG/w8S2Jmh44IeCL+Z+zfF0GpRtk/kZKJhKPF2M59RcPg57bw3Y3323p+k mttcUF6Ldx3Wq85uGKFURxqrziDdWVpw21RgBTbuy8HeUMRAAkJvGb2hVUWudxbHRN Ti9q5U/oqGzQw== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id GXWQ8CbnROrO; Mon, 3 Apr 2023 14:38:42 -0700 (PDT) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id D2C2F3C097AFC; Mon, 3 Apr 2023 14:38:42 -0700 (PDT) Message-ID: <2554712d-e386-3bab-bc6c-1f0e85d999db@HIDDEN> Date: Mon, 3 Apr 2023 14:38:42 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Content-Language: en-US To: 60690 <at> debbugs.gnu.org From: Paul Eggert <eggert@HIDDEN> Subject: -P '\d' in GNU and git grep Organization: UCLA Computer Science Department Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, mega lith01 <megalith01@HIDDEN>, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=c3=b3n?= <carenas@HIDDEN>, =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, git@HIDDEN, gitster@HIDDEN, =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= <tukusejssirs@HIDDEN>, pcre-dev@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) I've recently done some bug-report maintenance about a set of GNU grep bug reports related to whether whether "grep -P '\d'" should match non-ASCII digits, and have some thoughts about coordinating GNU grep with git grep in this department. GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on Savannah's copy of GNU grep, and some sort of fix should appear in the next grep release. However, I'm leaving the GNU grep bug report open for now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug found in latest stable release v3.10 of grep". I merged these related bug reports, and the oldest one, Bug#60690, is now the representative displayed in the GNU grep bug list[4]. For this set of grep bug reports there's still a pending issue discussed in my recent email[5], which proposes a patch so I've tagged Bug#60690 with "patch". The proposal is that GNU grep -P '\d' should revert to the grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match non-ASCII decimal digits. In researching this a bit further, I found that on March 23 Git disabled the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should appear in the next PCRE2 release. When PCRE2 10.35 comes out, it appears that 'git grep -P' will behave like 'grep -P' only if GNU grep adopts something like the solution proposed in [5]. [1]: https://bugs.gnu.org/62605 [2]: https://bugs.gnu.org/60690 [3]: https://bugs.gnu.org/62552 [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html [6]: https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8 [7]: https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@HIDDEN/ [8]: https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Paul Eggert <eggert@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 9 Jan 2023 23:12:33 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jan 09 18:12:33 2023 Received: from localhost ([127.0.0.1]:38384 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pF1Jh-0004xc-Cu for submit <at> debbugs.gnu.org; Mon, 09 Jan 2023 18:12:33 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:54480) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pF1Jf-0004xH-0V for 60690 <at> debbugs.gnu.org; Mon, 09 Jan 2023 18:12:32 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E3DC2160054; Mon, 9 Jan 2023 15:12:24 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id QUKYHr1oDdhJ; Mon, 9 Jan 2023 15:12:24 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0E66D160056; Mon, 9 Jan 2023 15:12:24 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra.cs.ucla.edu 0E66D160056 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=78364E5A-2AF3-11ED-87FA-8298ECA2D365; t=1673305944; bh=BjAoBqk7Uzf8782MPAQGxCGw88gce3GnII+peA/HPj8=; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type: Content-Transfer-Encoding; b=ZCl0Y/BcJcKj5hCPFkegIR4vlqONfqZRG4dB2EtSGSzY05belsjIHgkeIBzmvZ0U7 BbM4HgPJBYzfghxfciZG3wKzgKmrG0INprlLZRmUTx+oNGKRjFkaiOopMNL4cLiOfo VM6e7aRUAh+FfoeiTY+DyikrricO6FqHC87BLtiE= X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id O3xC3cGB2yBT; Mon, 9 Jan 2023 15:12:23 -0800 (PST) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id DAE02160054; Mon, 9 Jan 2023 15:12:23 -0800 (PST) Message-ID: <80b42740-c85b-cef2-622c-c5b2450e264c@HIDDEN> Date: Mon, 9 Jan 2023 15:12:23 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P Content-Language: en-US To: =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN> References: <20230108062335.72114-1-carenas@HIDDEN> <20230108155217.2817-1-carenas@HIDDEN> <230109.86v8lf297g.gmgdl@HIDDEN> <d6814350-10a3-55c0-68da-7e691976cd45@HIDDEN> <230109.865ydf1mdu.gmgdl@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department In-Reply-To: <230109.865ydf1mdu.gmgdl@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -3.4 (---) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=c3=b3n?= <carenas@HIDDEN>, pcre-dev@HIDDEN, gitster@HIDDEN, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -4.4 (----) On 1/9/23 11:51, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: > /b: > 155781 > (*UCP)/b: > 46035 > /s: > 0 > (*UCP)/s: > 0 > /w: > 142468 > (*UCP)/w: > 9706 >=20 > So the output still differs, and some of those differences may or may > not be wanted. I took a look at the output, and by and large I'd want the differences;=20 that is, I'd want the UCP version, which generates less output. This is=20 because several Emacs source files are not UTF-8, and \b has nonsense=20 matches when searching text files encoded via Shift-JIS or Big 5 or=20 whatever. For this sort of thing, the fewer matches the better. > If all you're doing is matching either ASCII or Japanese text and you > want "locale-aware numbers" it might do the wrong thing. I'm not seeing much of a problem here. When searching Japanese text, I=20 would expect \d and [0-9=EF=BC=90-=EF=BC=99] (using both ASCII and full-w= idth digits) to=20 be equivalent so (assuming UCP) it's not a big deal as to which regex=20 you use, since Japanese text won't contain Bengali (or whatever) digits.=20 And when searching binary data, I'd expect a bunch of garbage no matter=20 how \d is interpreted. Here I'm assuming [=EF=BC=90-=EF=BC=99] (using full-width digits) has the= expected=20 meaning in PCRE2, i.e., that PCRE2 didn't make the same mistake that=20 POSIX made.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 9 Jan 2023 20:30:47 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jan 09 15:30:47 2023 Received: from localhost ([127.0.0.1]:38294 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pEyn9-0000iP-3m for submit <at> debbugs.gnu.org; Mon, 09 Jan 2023 15:30:47 -0500 Received: from mail-wm1-f43.google.com ([209.85.128.43]:40912) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <avarab@HIDDEN>) id 1pEyn6-0000i9-SK for 60690 <at> debbugs.gnu.org; Mon, 09 Jan 2023 15:30:45 -0500 Received: by mail-wm1-f43.google.com with SMTP id k26-20020a05600c1c9a00b003d972646a7dso10071434wms.5 for <60690 <at> debbugs.gnu.org>; Mon, 09 Jan 2023 12:30:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to :user-agent:references:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=cV2B3Fs1/6O7Uy4CRU0uLA7GJTkWAyuh3z7yR6u80Fo=; b=qSlXd4lPgqMgxz/j7/CDvaHkrFYYkMkm3p8SA+4hh5ytxFcoz5lLUroCWluyrFyHiR 3jd2E+0wyXpLpa4oqFbh68YQfq7Dd/hnwvWO3nVQuj59KkJBShp1YAPLaFFmdamaSteJ fJdxcOfPL8gxdfv79TQR+K+bmPUEenHj/3W+/I+T5Wpf7dHB84+1LTHGFXVgNCRh0cXm s1uQTT6COLJKBTFc+IU2Y7wXfPt4hbqJrqyd2ZvSILgc+9iUesbeM+4tEI7YcdEmkBf2 uLLZj3yQ853BfzBCRxhupcGMMi6cyeqrDdUwIXNBUC6H6VjNWUvKPOYT2DWanOqcq9m1 VCRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to :user-agent:references:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=cV2B3Fs1/6O7Uy4CRU0uLA7GJTkWAyuh3z7yR6u80Fo=; b=CPTuR9rm2y3g/EAyKjAfQvrOsheO152mVRTIb66+PtPcdIUc7bV4AqYFMrXjtjpImZ WcMD9GyD+5hnF9N2vWvvbaHc4i11DB/ChrtnHs46i9BfW0fTNhEXQ5E6FahO6XmuoeSs 773KuYMh0GU3PkiyGPkBH9HIKRJEeQpykBKxD4cz1g6doDwyxWnP+wU96lb3vBcvGZqC 5P7htrffoWZou70aSbZiHGWtPWN2I2xubiI/wYxiCUmdto46oRnBqkbZlTblRIFsrn6C f2YN0sDZfxyQdoR9jc+zMd4kC+q9hPG0YDPvEiYo0cOX6/v6LMTKveAW9SBl2cP1ZYE0 Mqng== X-Gm-Message-State: AFqh2kqhYQeOPJypuelxcpZfDCHSvP4qZ9QOKPzc6/M12Y8SX2bWx7cJ h/tm4QqNKNY06CzDzKn8Qew= X-Google-Smtp-Source: AMrXdXv7V17hVWxQ6hsPGnebCsQ7ke4Qam3ZC3PynrU/i5qecInH4Lv4l/qSArqHGmk4skrHGzy8aQ== X-Received: by 2002:a05:600c:3485:b0:3d0:761b:f86 with SMTP id a5-20020a05600c348500b003d0761b0f86mr47038283wmq.28.1673296238834; Mon, 09 Jan 2023 12:30:38 -0800 (PST) Received: from gmgdl (j84076.upc-j.chello.nl. [24.132.84.76]) by smtp.gmail.com with ESMTPSA id p21-20020a7bcc95000000b003c65c9a36dfsm12181465wma.48.2023.01.09.12.30.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Jan 2023 12:30:38 -0800 (PST) Received: from avar by gmgdl with local (Exim 4.96) (envelope-from <avarab@HIDDEN>) id 1pEymz-000Cpl-2S; Mon, 09 Jan 2023 21:30:37 +0100 From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@HIDDEN> To: Paul Eggert <eggert@HIDDEN> Subject: Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P Date: Mon, 09 Jan 2023 20:51:00 +0100 References: <20230108062335.72114-1-carenas@HIDDEN> <20230108155217.2817-1-carenas@HIDDEN> <230109.86v8lf297g.gmgdl@HIDDEN> <d6814350-10a3-55c0-68da-7e691976cd45@HIDDEN> User-agent: Debian GNU/Linux bookworm/sid; Emacs 28.2; mu4e 1.9.0 In-reply-to: <d6814350-10a3-55c0-68da-7e691976cd45@HIDDEN> Message-ID: <230109.865ydf1mdu.gmgdl@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, 60690 <at> debbugs.gnu.org, Carlo Marcelo Arenas =?utf-8?Q?Bel=C3=B3n?= <carenas@HIDDEN>, pcre-dev@HIDDEN, gitster@HIDDEN, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.0 (-) On Mon, Jan 09 2023, Paul Eggert wrote: > On 1/9/23 03:35, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: > >> You almost never want "everything Unicode considers a digit", and if you >> do using e.g. \p{Nd} instead of \d would be better in terms of >> expressing your intent. > > For GNU grep, PCRE2_UCP is needed because of examples like what > Gro-Tsen and Karl Petterssen supplied. [For reference, referring to this Twitter thread: https://twitter.com/gro_tsen/status/1610972356972875777] Those examples compared -E and -P. I think it's correct that UCP brings the behavior closer to -E, but it's also different in various ways. E.g. on emacs.git (which I've been finding to be quite a nice test case) a comparison of the two, with "git grep" because I found it easier to test, but GNU grep will presumably find the same for those files: =09 for c in b s w do for pfx in '' '(*UCP)' do echo "$pfx/$c:" && diff -u <(git -P grep -E "\\$c") <(git -P grep -P "$pfx\\$c") | wc -l done done Yields: /b: 155781 (*UCP)/b: 46035 /s: 0 (*UCP)/s: 0 /w: 142468 (*UCP)/w: 9706 So the output still differs, and some of those differences may or may not be wanted. > If there's some diagreement > about how \d should behave with UTF-8 data the GNU grep hackers should > let the Perl community decide that; that is, GNU grep can simply > follow PCRE2's lead. PCRE2 tends to follow Perl, I'm mainly trying to point out here that it isn't a-priory clear how "let Perl decide" is supposed to map to the of a "grep"-like utility, since the Perl behavior is inherently tied up with knowing the encoding of the target data. For GNU grep and "git grep" that's more of an all-or-nothing with locales, although in this case being as close as possible to -E is probably more correct than not. >> $ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d') >> 53360a53361,53362 >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=91=E8=A1=8C: =E4=BD=95=E3=82=92= =E3=81=97=E3=81=9F=E3=81=8B=E3=80=81=E3=82=92=EF=BC=91=E8=A1=8C=E3=81=A7=E8= =A6=81=E7=B4=84=E3=80=82\n" >> > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=92=E8=A1=8C: =E7=A9=BA=E7=99=BD\n" > > Although I don't speak Japanese I have dealt with quite a bit of > Japanese text in a previous job, and personally I would prefer \d to > match those two lines as they do contain digits. So to me this > particular case is not a good argument that git grep should not match > those lines. I'm mainly raising the backwards compatibility concern, which GNU grep and git grep may or may not want to handle differently, but let's at least be aware of the various edge cases. For \b I think it mostly does the right thing. For \w and \d in particular I'm mainly noting that yes, sometimes you want to match [0-9], and sometimes you'd want to match Japanese numbers, but you rarely (or at least I haven't) want to match everything Unicode considers X, unless you're doing some self-reflection on Unicode itself. E.g. for \d it's at least (up from just 10): $ perl -CO -wE 'for (1..2**20) { say chr if chr =3D~ /\d/ }'|wc -l 650 For \w you similarly go from ~60 to ~130k: $ perl -CO -wE 'for (1..2**24) { say chr if chr =3D~ /\w/ }'|wc -l 134564 If all you're doing is matching either ASCII or Japanese text and you want "locale-aware numbers" it might do the wrong thing. But I've found it to be too promiscuous when casting a wider net, which is the usual use-case with 'grep". > Of course other people might prefer differently, and there are cases > where I want to match only ASCII digits. I've learned in the past to > use [0-9] for that. I hope PCRE2 never changes [0-9] to match anything > but ASCII digits when searching UTF-8 text. I think that'll never change.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at 60690) by debbugs.gnu.org; 9 Jan 2023 18:40:31 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jan 09 13:40:31 2023 Received: from localhost ([127.0.0.1]:38164 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pEx4R-0005tj-BU for submit <at> debbugs.gnu.org; Mon, 09 Jan 2023 13:40:31 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:47864) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1pEx4O-0005tR-1L for 60690 <at> debbugs.gnu.org; Mon, 09 Jan 2023 13:40:29 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0D99E160043; Mon, 9 Jan 2023 10:40:21 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ilMecndlWFjy; Mon, 9 Jan 2023 10:40:20 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 38244160048; Mon, 9 Jan 2023 10:40:20 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra.cs.ucla.edu 38244160048 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=78364E5A-2AF3-11ED-87FA-8298ECA2D365; t=1673289620; bh=sPewYtI2nslfvK5V0TkcNuGYU9M7ggzvuzpTsLpmiW8=; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type: Content-Transfer-Encoding; b=bEFcX4X9DsSGEwgVJ1w43dcGCKwvpVv0j3FPVy7UOcbDOchs1ajZ7UzGGfznQi0uF DJSbUmvaEVtzzS3N03J+B+qArGHZf5EaPiPO3EtgdQSFtIcI3hgKe5HJmEwwG2DSTj XoUm7xyk5hGPN5K7r32q1LbpVvFNfyWaIVE8LxIA= X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 5dMvSc-VXdfb; Mon, 9 Jan 2023 10:40:20 -0800 (PST) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 17FEF160043; Mon, 9 Jan 2023 10:40:20 -0800 (PST) Message-ID: <d6814350-10a3-55c0-68da-7e691976cd45@HIDDEN> Date: Mon, 9 Jan 2023 10:40:16 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P Content-Language: en-US To: =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@HIDDEN>, =?UTF-8?Q?Carlo_Marcelo_Arenas_Bel=c3=b3n?= <carenas@HIDDEN> References: <20230108062335.72114-1-carenas@HIDDEN> <20230108155217.2817-1-carenas@HIDDEN> <230109.86v8lf297g.gmgdl@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department In-Reply-To: <230109.86v8lf297g.gmgdl@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -3.4 (---) X-Debbugs-Envelope-To: 60690 Cc: demerphq@HIDDEN, pcre-dev@HIDDEN, 60690 <at> debbugs.gnu.org, git@HIDDEN, gitster@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -4.4 (----) On 1/9/23 03:35, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: > You almost never want "everything Unicode considers a digit", and if yo= u > do using e.g. \p{Nd} instead of \d would be better in terms of > expressing your intent. For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen=20 and Karl Petterssen supplied. If there's some diagreement about how \d=20 should behave with UTF-8 data the GNU grep hackers should let the Perl=20 community decide that; that is, GNU grep can simply follow PCRE2's lead.=20 But GNU grep does need PCRE2_UCP for \b etc. > $ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d') > 53360a53361,53362 > > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=91=E8=A1=8C: =E4=BD=95=E3=82=92=E3= =81=97=E3=81=9F=E3=81=8B=E3=80=81=E3=82=92=EF=BC=91=E8=A1=8C=E3=81=A7=E8=A6= =81=E7=B4=84=E3=80=82\n" > > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=92=E8=A1=8C: =E7=A9=BA=E7=99=BD\= n" Although I don't speak Japanese I have dealt with quite a bit of=20 Japanese text in a previous job, and personally I would prefer \d to=20 match those two lines as they do contain digits. So to me this=20 particular case is not a good argument that git grep should not match=20 those lines. Of course other people might prefer differently, and there are cases=20 where I want to match only ASCII digits. I've learned in the past to use=20 [0-9] for that. I hope PCRE2 never changes [0-9] to match anything but=20 ASCII digits when searching UTF-8 text.
bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.Received: (at submit) by debbugs.gnu.org; 9 Jan 2023 12:18:14 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jan 09 07:18:14 2023 Received: from localhost ([127.0.0.1]:35929 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1pEr6U-0005P3-5w for submit <at> debbugs.gnu.org; Mon, 09 Jan 2023 07:18:14 -0500 Received: from lists.gnu.org ([209.51.188.17]:53372) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <avarab@HIDDEN>) id 1pEr6R-0005Ou-83 for submit <at> debbugs.gnu.org; Mon, 09 Jan 2023 07:18:12 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <avarab@HIDDEN>) id 1pEr69-0005tl-00 for bug-grep@HIDDEN; Mon, 09 Jan 2023 07:18:05 -0500 Received: from mail-wr1-x42d.google.com ([2a00:1450:4864:20::42d]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from <avarab@HIDDEN>) id 1pEr61-00030N-4F for bug-grep@HIDDEN; Mon, 09 Jan 2023 07:17:52 -0500 Received: by mail-wr1-x42d.google.com with SMTP id co23so7958266wrb.4 for <bug-grep@HIDDEN>; Mon, 09 Jan 2023 04:17:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to :user-agent:references:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=6FM4ode4z9XvpMbkwv8gUrDG/eDbYEnqHfmVDyiMfMs=; b=G8sZwJ/QhrGdvyFyUuC0bpLb3Tmz6xNsInXJh0GGNnkfkYM65mqOes0L+zm16fSJxe OU9bMKDTWMuoQ/qM1xS5FzXr+vPmKdYCGhu2qpP2KmaNPv8dfjICXg0rDQJ4H0JE9dso z1GjkmEjSf7ErO5nIWeDJCA26Za12e6V3k1C3CeXcmnTW6It69sT3wH9OSCGD32YONnb qZoOd4PvmvPK1xy3H6uuodcGzA8maPe5oR8LMEOZgMy0svfwxN4P+2MOYGliSSusMCsZ IFpbxK1WENXAAto4xdsHUk2UdZ8jARgYNeHNeJmI0r0MbfgJsUH3PVNuZuDsLgsLAaIR 8FUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to :user-agent:references:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=6FM4ode4z9XvpMbkwv8gUrDG/eDbYEnqHfmVDyiMfMs=; b=Ca6bvZOsjCshM71D+WbueQyE/Tw+XdHROBHOLF7MHypnc/V0Kmwwd76AnJj+GjvTER +NcbIoK6khMNSWPv+NEXLBUfAX/pGXEIKboB9OOrC+jyWGUakaOU4D7q4ygYiyNoO6+p bK4ElI9D+wN9IqUP2+sKNbm34uFaE6WMJRtdtdSOfkYPjH8TEEcxLfCtp3FdUTQkSmyg z743dGuuxsc9Gtwc2Ui/o2Urpnow8KXs00nToNgpxrBRABoXhJQ8/rgx/y6hvy5ysSlJ dKGkRtOu8jvoEr/Fa6TlYXj/LYBJHU0QL66hSIkGTOCyNZ8jlg5DIHTtshABHLj0muUY acNQ== X-Gm-Message-State: AFqh2kpoMDiSdYWfKSkSBQibS9eoyeKc19ye36LOGAUsYx+niCmLwqLr zXw+S5eoBp54BRtnWn+gUuo= X-Google-Smtp-Source: AMrXdXuxydozZmGlzg24dWp9mmZ1rdkRLma87wEuyDeIia3LVN0uz1XuTBmCq4sM/JrFiqU8SPVPbg== X-Received: by 2002:a5d:5d10:0:b0:242:5b1f:3dcf with SMTP id ch16-20020a5d5d10000000b002425b1f3dcfmr55685723wrb.63.1673266661060; Mon, 09 Jan 2023 04:17:41 -0800 (PST) Received: from gmgdl (j84076.upc-j.chello.nl. [24.132.84.76]) by smtp.gmail.com with ESMTPSA id w5-20020a05600018c500b002420dba6447sm8395393wrq.59.2023.01.09.04.17.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Jan 2023 04:17:40 -0800 (PST) Received: from avar by gmgdl with local (Exim 4.96) (envelope-from <avarab@HIDDEN>) id 1pEr5v-0003T0-2o; Mon, 09 Jan 2023 13:17:39 +0100 From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@HIDDEN> To: Carlo Marcelo Arenas =?utf-8?Q?Bel=C3=B3n?= <carenas@HIDDEN> Subject: Re: [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P Date: Mon, 09 Jan 2023 12:35:05 +0100 References: <20230108062335.72114-1-carenas@HIDDEN> <20230108155217.2817-1-carenas@HIDDEN> User-agent: Debian GNU/Linux bookworm/sid; Emacs 28.2; mu4e 1.9.0 In-reply-to: <20230108155217.2817-1-carenas@HIDDEN> Message-ID: <230109.86v8lf297g.gmgdl@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::42d; envelope-from=avarab@HIDDEN; helo=mail-wr1-x42d.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit Cc: demerphq <demerphq@HIDDEN>, pcre-dev@HIDDEN, bug-grep@HIDDEN, gitster@HIDDEN, git@HIDDEN X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.3 (--) On Sun, Jan 08 2023, Carlo Marcelo Arenas Bel=C3=B3n wrote: > When UTF is enabled for a PCRE match, the corresponding flags are > added to the pcre2_compile() call, but PCRE2_UCP wasn't included. > > This prevents extending the meaning of the character classes to > include those new valid characters and therefore result in failed > matches for expressions that rely on that extention, for ex: > > $ git grep -P '\b=C3=86var' > > Add PCRE2_UCP so that \w will include =C3=86 and therefore \b could > correctly match the beginning of that word. > > This has an impact on performance that has been estimated to be > between 20% to 40% and that is shown through the added performance > test. > > Signed-off-by: Carlo Marcelo Arenas Bel=C3=B3n <carenas@HIDDEN> > --- > grep.c | 2 +- > t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++ > 2 files changed, 43 insertions(+), 1 deletion(-) > create mode 100755 t/perf/p7822-grep-perl-character.sh > > diff --git a/grep.c b/grep.c > index 06eed69493..1687f65b64 100644 > --- a/grep.c > +++ b/grep.c > @@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p,= const struct grep_opt *opt > options |=3D PCRE2_CASELESS; > } > if (!opt->ignore_locale && is_utf8_locale() && !literal) > - options |=3D (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > + options |=3D (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); I have a definite bias towards liking this change, it would help my find myself :) But I don't think it's safe to change the default behavior "git-grep", it's not a mere bug fix, but a major behavior change for existing users of grep.patternType=3Dperl. E.g. on git.git: =09 $ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d') 53360a53361,53362 > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=91=E8=A1=8C: =E4=BD=95=E3=82=92=E3= =81=97=E3=81=9F=E3=81=8B=E3=80=81=E3=82=92=EF=BC=91=E8=A1=8C=E3=81=A7=E8=A6= =81=E7=B4=84=E3=80=82\n" > git-gui/po/ja.po:"- =E7=AC=AC=EF=BC=92=E8=A1=8C: =E7=A9=BA=E7=99=BD\n" So, it will help "do the right thing" on e.g. "\b=C3=86", but it will also find e.g. CJK numeric characters for \d etc. I see per the discussion on https://github.com/PCRE2Project/pcre2/issues/185 and https://lists.gnu.org/archive/html/bug-grep/2023-01/threads.html that you submitted similar fixes to GNU grep & PCRE itself. I see that GNU grep integrated it a couple of days ago as https://git.savannah.gnu.org/cgit/grep.git/commit/?id=3D5e3b760f65f13856e57= 17e5b9d935f5b4a615be3 As most discussions about PCRE will eventually devolve into "what does Perl do?": "Perl" itself will promiscuously use this behavior by default. E.g. here the same "=EF=BC=91" character (not the ASCII digit "1") will be matched from the command-line: $ perl -Mre=3Ddebug -CA -wE 'shift =3D~ /\d/' "=EF=BC=91" Compiling REx "\d" Final program: 1: POSIXU[\d] (2) 2: END (0) stclass POSIXU[\d] minlen 1 Matching REx "\d" against "%x{ff11}" UTF-8 string... Matching stclass POSIXU[\d] against "%x{ff11}" (3 bytes) 0 <> <%x{ff11}> | 0| 1:POSIXU[\d](2) 3 <%x{ff11}> <> | 0| 2:END(0) Match successful! Freeing REx: "\d" But I don't think it makes sense for "git grep" (or GNU "grep") to follow Perl in this particular case. For those not familiar with its Unicode model it doesn't assume by default that strings are Unicode, they have to be explicitly marked as such. in the above example I'm declaring that all of "argv" is UTF-8 (via the "-CA" flag). If I didn't supply that flag the string wouldn't have the UTF-8 flag, and wouldn't match, as the Perl regex engine won't use Unicode semantics except on Unicode target strings. Even for Perl, this behavior has been troublesome. Opinions differ, but I think many would agree (and I've CC'd the main authority on Perl's regex engine) that doing this by default was *probably* a mistake. You almost never want "everything Unicode considers a digit", and if you do using e.g. \p{Nd} instead of \d would be better in terms of expressing your intent. I see you're running into this on the PCRE tracker, where you're suggesting that the equivalent of /a (or /aa) would be needed. https://github.com/PCRE2Project/pcre2/issues/185#issuecomment-1374796393 Which brings me home to the seeming digression about "Perl" above. Unlike a programming language where you'll typically "mark" your data as it comes in, natural text as UTF-8, binary data as such etc., a "grep" utility has to operate on more of an "all or nothing" basis (except in the case of "-a"). I.e. we're usually searching through unknown data. Enabling this by default means that we'll pick up characters most people probably wouldn't expect, particularly from near-binary data formats (those that won't require "-a", but contain non-Unicode non-ASCII sequences). I don't have some completely holistic view of what we should do in every case, e.g. we turned on PCRE2_UTF so that things like "-i" would Just Work, but even case-insensitivity has its own unexpected edge cases in Unicode. But I don't think those edge cases are nearly as common as those we'd run into by enabling PCRE2_UCP. Rather than trying to opt-out with "/a" or "/aa" I think this should be opt-in. As the example at the start shows you can already do this with "(*UCP)" in the pattern, so perhaps we should just link to the pcre2pattern(3) manual from git-grep(1)?
Ævar Arnfjörð Bjarmason <avarab@HIDDEN>
:bug-grep@HIDDEN
.
Full text available.bug-grep@HIDDEN
:bug#60690
; Package grep
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.