GNU bug report logs - #64128
regexp parser zero-width assertion bugs

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Sat, 17 Jun 2023 12:21:02 UTC

Severity: normal

To reply to this bug, email your comments to 64128 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sat, 17 Jun 2023 12:21:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 17 Jun 2023 12:21:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:20:27 +0200

[Message part 1 (text/plain, inline)]

In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.

A modest improvement would be to make operators become literal after any zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.

[regexp-zero-width-assertion-bug.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sat, 17 Jun 2023 18:45:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:44:30 -0400

> (and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
>      (match-data))
> => (0 18)

That's so bizarre that it feels like we really should try and preserve
it for posterity.
Not.

> These regexp patterns should be very rare in practice: they should
> always be a mistake, but it would be nice if they behaved in a way
> that makes some kind of sense.
>
> A modest improvement would be to make operators become literal after
> any zero-width assertion, so that

I think the behavior that makes most sense is to signal an error when
compiling the regexp.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sat, 17 Jun 2023 20:09:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 22:07:56 +0200

17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> I think the behavior that makes most sense is to signal an error when
> compiling the regexp.

Clearly, but some behaviour needs to be preserved for compatibility.
Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sat, 17 Jun 2023 22:19:02 GMT) Full text and rfc822 format available.

Message #14 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 64128 <at> debbugs.gnu.org
Subject: Re: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 15:18:00 -0700

[Message part 1 (text/plain, inline)]

On 2023-06-17 13:07, Mattias Engdegård wrote:
> 17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> 
>> I think the behavior that makes most sense is to signal an error when
>> compiling the regexp.
> 
> Clearly, but some behaviour needs to be preserved for compatibility.
> Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> 

doc/lispref/searching.texi says that "*" is treated as an ordinary 
character if it is in a context where its special meaning makes no 
sense, giving "*foo" as an example. If we break with this tradition by 
making "\b*" an error instead of being equivalent to "\b\*", we should 
update that part of the manual.

One possible way forward is to update doc/lispref/searching.texi to 
specify what we want. Then we can modify the code to match the updated 
documentation.

In my experience, modifying the doc is often the hard part, so I took a 
crack at that in the draft proposed patch, which I have not installed.

Comments?

[0001-Document-that-b-etc-are-now-invalid-regexps.patch (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sun, 18 Jun 2023 04:56:02 GMT) Full text and rfc822 format available.

Message #17 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: mattias.engdegard <at> gmail.com, monnier <at> iro.umontreal.ca,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Sun, 18 Jun 2023 07:55:01 +0300

> Cc: 64128 <at> debbugs.gnu.org
> Date: Sat, 17 Jun 2023 15:18:00 -0700
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> 
> > Clearly, but some behaviour needs to be preserved for compatibility.
> > Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> > 
> 
> doc/lispref/searching.texi says that "*" is treated as an ordinary 
> character if it is in a context where its special meaning makes no 
> sense, giving "*foo" as an example. If we break with this tradition by 
> making "\b*" an error instead of being equivalent to "\b\*", we should 
> update that part of the manual.
> 
> One possible way forward is to update doc/lispref/searching.texi to 
> specify what we want. Then we can modify the code to match the updated 
> documentation.
> 
> In my experience, modifying the doc is often the hard part, so I took a 
> crack at that in the draft proposed patch, which I have not installed.
> 
> Comments?

My comment is that since this was a documented feature, I'm not
interested in making it an error.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Sun, 18 Jun 2023 20:27:01 GMT) Full text and rfc822 format available.

Message #20 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, monnier <at> iro.umontreal.ca,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Sun, 18 Jun 2023 22:26:28 +0200

18 juni 2023 kl. 06.55 skrev Eli Zaretskii <eliz <at> gnu.org>:

> My comment is that since this was a documented feature, I'm not
> interested in making it an error.

Yes, it would be unwise to raise an error for "^*" or the like; it's in active use.
The manual is a bit hazy about what we actually promise, though.

As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?

To make everything clear, we have to groups of zero-width assertions:

Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=

Group B assertions work like ordinary elements, syntactically and semantically. Simple, predictable, but also useless.

Group A assertions are more interesting: either there is nothing before a train of such assertions, such as

   "^\\`\\b\\`*?"

which turns the first character of the operator into a literal (and a second character, if present, now becomes an operator acting on that literal).
Or there is something, and the operator acts on the last element preceding the assertions, except that multiple literal characters coalesce to a single element. Except if one of the literal chars is an out-of-place `^` which splits a sequence of literals into separate segments but not exactly where you think it would.
For example,

  "abc^def\\B\\B+?"

means, I think,

  (seq "ab" (+? "c^def" not-word-boundary not-word-boundary))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 03:05:01 GMT) Full text and rfc822 format available.

Message #23 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Sun, 18 Jun 2023 23:04:49 -0400

> To make everything clear, we have to groups of zero-width assertions:
>
> Group A: ^ $ \` \' \b \B

IIRC `^` is only special if it's at the beginning of a group, so `^*` will
always treat this * as a literal, right?
"Similarly" `$` is only special if it's at the end of a group, so `$*` will
always be a repetition of the $ character no?

So the remaining problematic elements are \` \' \b and \B

I suspect if we don't want to signal errors, the next best thing is to
treat them like group B.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 08:45:02 GMT) Full text and rfc822 format available.

Message #26 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 10:44:04 +0200

19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> `^` is only special if it's at the beginning of a group, so `^*` will
> always treat this * as a literal, right?
> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
> always be a repetition of the $ character no?

Yes, ^ and $ have additional rules for when they are plain literals and not subject to these bugs at all.

The literal-splitting powers of ^ have now (075e77ac44) been removed.

> So the remaining problematic elements are \` \' \b and \B

\`* has been observed, so we probably need to keep that working as well.

> I suspect if we don't want to signal errors, the next best thing is to
> treat them like group B.

Yes, maybe; they are less likely to be followed by an operator-literal, but it would also be good to have all zero-width assertions work the same way.
On the other hand, it can't be worse than we have now, as long as we get rid of the "quack,\\b*" semantics.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 12:55:02 GMT) Full text and rfc822 format available.

Message #29 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 08:54:22 -0400

I wish there was a way to emit warnings about oddball constructs
(starting with the "* is literal when encountered at the beginning of
a regexp").


        Stefan


Mattias Engdegård [2023-06-19 10:44:04] wrote:

> 19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
>
>> `^` is only special if it's at the beginning of a group, so `^*` will
>> always treat this * as a literal, right?
>> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
>> always be a repetition of the $ character no?
>
> Yes, ^ and $ have additional rules for when they are plain literals and not
> subject to these bugs at all.
>
> The literal-splitting powers of ^ have now (075e77ac44) been removed.
>
>> So the remaining problematic elements are \` \' \b and \B
>
> \`* has been observed, so we probably need to keep that working as well.
>
>> I suspect if we don't want to signal errors, the next best thing is to
>> treat them like group B.
>
> Yes, maybe; they are less likely to be followed by an operator-literal, but
> it would also be good to have all zero-width assertions work the same way.
> On the other hand, it can't be worse than we have now, as long as we get rid
> of the "quack,\\b*" semantics.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 18:15:02 GMT) Full text and rfc822 format available.

Message #32 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 11:14:18 -0700

[Message part 1 (text/plain, inline)]

On 2023-06-18 13:26, Mattias Engdegård wrote:
> The manual is a bit hazy about what we actually promise, though.
> 
> As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?

Although it's not easy to document, we should do better. I gave that a 
shot by installing the attached patches into the master branch. These 
patches try to document current behavior, including warning about the 
squirrelly behavior you mention. If/when we fix the squirrelly behavior 
we can change that part of the manual accordingly.

The last of the three patches is merely a terminology change: it 
standardizes on the term "bracket expression" for regexps like [a-z]. 
Formerly the doc and comments were inconsistent about the terminology. 
It's better to stick with the POSIX term here, to avoid confusion. I 
myself got confused about this when editing the other two patches.

Comments welcome as usual.

[0001-Document-regular-expression-special-cases-better.patch (text/x-patch, attachment)]

[0002-Document-Emacs-vs-POSIX-REs.patch (text/x-patch, attachment)]

[0003-Call-them-bracket-expressions-more-consistently.patch (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 18:35:01 GMT) Full text and rfc822 format available.

Message #35 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 20:34:42 +0200

[Message part 1 (text/plain, inline)]

19 juni 2023 kl. 14.54 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> 
> I wish there was a way to emit warnings about oddball constructs
> (starting with the "* is literal when encountered at the beginning of
> a regexp").

I agree, but I'm more of a static analysis man. (And relint does complain about all these cases as long as the regexp is detected as such, so there probably aren't many of them left in the Emacs tree.)

Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
Maybe we should change group B assertions so that they work in the same way.

[regexp-zero-width-assertion-noquack.diff (application/octet-stream, attachment)]

[Message part 3 (text/plain, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 19:22:01 GMT) Full text and rfc822 format available.

Message #38 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 12:21:50 -0700

On 2023-06-19 11:34, Mattias Engdegård wrote:
> Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
> Maybe we should change group B assertions so that they work in the same way.

> -     operand.  Reset at the beginning of groups and alternatives.  */
> +     operand.  Reset at the beginning of groups and alternatives,
> +     and after zero-width assertions which should not be the target
> +     of any postfix repetition operators.  */

If I understand things correctly, this would cause "\b*c" to be treated 
like "\b\*c". If so, it's headed in the wrong direction.

It's long been documented that the only reason "*" is ordinary at the 
start of a regular expression or subexpression is "historical 
compatibility", and it's also long been documented that you shouldn't 
take advantage of this and you should backslash-escape the "*" anyway. 
In contrast, for constructs like \b* there is not a historical 
compatibility reason, so there's not a good argument for treating "*" as 
an ordinary character after "\b".

Instead, \b should not be a special case before "*", and \b* should be 
equivalent to \(\b\)* and should match only the empty string. Similarly 
for the other zero-width backslash escapes. This is what I would expect 
from these constructs from the longstanding documentation.

If we instead added a rule to say that a construct that can only match 
the empty string causes following "*" to ordinary, then \b* and \(\b\)* 
would both be equivalent to \*. Although consistent, this would be 
confusing: it would compound the historical-compatibility mistake. Let's 
keep things simple instead.

Also, whatever change we make to the behavior should be documented in 
the manual and in etc/NEWS.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 19:53:01 GMT) Full text and rfc822 format available.

Message #41 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 21:52:40 +0200

19 juni 2023 kl. 21.21 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".

Actually it already works that way. What the patch does, is preventing AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which I think we can all agree is less wrong.

You can check the test cases in the patch:

  (should (equal (string-match "q\\b*!" "q*!") 0))
  (should (equal (string-match "q\\b*!" "!") nil))

which in current Emacs produce 2 and 0 respectively.

> It's long been documented that the only reason "*" is ordinary at the start of a regular expression or subexpression is "historical compatibility", and it's also long been documented that you shouldn't take advantage of this and you should backslash-escape the "*" anyway. In contrast, for constructs like \b* there is not a historical compatibility reason, so there's not a good argument for treating "*" as an ordinary character after "\b".

Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.
We also have \` to consider -- I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.

> Instead, \b should not be a special case before "*", and \b* should be equivalent to \(\b\)* and should match only the empty string. Similarly for the other zero-width backslash escapes. This is what I would expect from these constructs from the longstanding documentation.
> 
> If we instead added a rule to say that a construct that can only match the empty string causes following "*" to ordinary, then \b* and \(\b\)* would both be equivalent to \*. Although consistent, this would be confusing: it would compound the historical-compatibility mistake. Let's keep things simple instead.

Yes, I definitely would be confused by such semantics.

> Also, whatever change we make to the behavior should be documented in the manual and in etc/NEWS.

Will be happy to oblige, although in this case it really just was a bug fix.

What I really would like to see is the regexp parser somehow separated from the NFA bytecode generator, which would make both clearer. The parser could then be re-used for other purposes such as a different back-end (DFA construction) or a built-in xr-like converter.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 20:10:01 GMT) Full text and rfc822 format available.

Message #44 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 16:08:57 -0400

>> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".
> Actually it already works that way. What the patch does, is preventing
> AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which
> I think we can all agree is less wrong.

Hmm... maybe it's less wrong, but I'd rather make it behave like
AB\(\b\)*C, which is, I'd argue, even less wrong.

Or maybe make it signal an error: I can't imagine that the current
behavior is used by very much code at all, seeing how it's so
seriously non-intuitive.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Mon, 19 Jun 2023 20:41:01 GMT) Full text and rfc822 format available.

Message #47 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 13:40:06 -0700

[Message part 1 (text/plain, inline)]

On 2023-06-19 12:52, Mattias Engdegård wrote:

> Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.

OK, but we still need to fix this, as \b and \B should not be a special 
case for following "*".

> I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.

That makes some sense, in that \` is like ^, and ^ is already a special 
case (this is true even in POSIX BREs).

In other words, how about if we change the groups from your list:

Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=

to this:

Group A: ^ \`
Group B: $ \' \b \B \< \> \_< \_> \=

where "*" is ordinary after Group A, and special after Group B and there 
is no other squirrelly behavior. And similarly for the other repetition 
operators.

Attached is a proposed doc change for this, which I have not installed. 
Of course the code and etc/NEWS would need changing too.

[0001-Document-proposed-regex-fix-bug-64128.patch (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Tue, 20 Jun 2023 11:37:02 GMT) Full text and rfc822 format available.

Message #50 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Tue, 20 Jun 2023 13:36:42 +0200

[Message part 1 (text/plain, inline)]

19 juni 2023 kl. 22.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> Hmm... maybe it's less wrong, but I'd rather make it behave like
> AB\(\b\)*C, which is, I'd argue, even less wrong.

I agree, and you are probably right that it's safe to do that.

> Or maybe make it signal an error: I can't imagine that the current
> behavior is used by very much code at all, seeing how it's so
> seriously non-intuitive.

That might be even better if we can get away with it.

19 juni 2023 kl. 22.40 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> In other words, how about if we change the groups from your list:
> 
> Group A: ^ $ \` \' \b \B
> Group B: \< \> \_< \_> \=
> 
> to this:
> 
> Group A: ^ \`
> Group B: $ \' \b \B \< \> \_< \_> \=
> 
> where "*" is ordinary after Group A, and special after Group B and there is no other squirrelly behavior. And similarly for the other repetition operators.

Sounds fine, with the option to go full error on group B if we agree that that's even better.

> Attached is a proposed doc change for this, which I have not installed.

Thank you, it has been incorporated in the attached patch which follows your suggestions above.

Your previous regexp doc updates are most appreciated. I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.

[0001-Straighten-regexp-postfix-operator-after-zero-width-.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Wed, 21 Jun 2023 06:09:02 GMT) Full text and rfc822 format available.

Message #53 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Tue, 20 Jun 2023 23:08:38 -0700

On 2023-06-20 04:36, Mattias Engdegård wrote:

> Sounds fine, with the option to go full error on group B if we agree that that's even better.

That would be fine too. I'd even prefer it. In the meantime your patch 
looks good.


> I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.

Agreed.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64128; Package emacs. (Wed, 21 Jun 2023 15:58:02 GMT) Full text and rfc822 format available.

Message #56 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Wed, 21 Jun 2023 17:57:29 +0200

21 juni 2023 kl. 08.08 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

>> Sounds fine, with the option to go full error on group B if we agree that that's even better.
> 
> That would be fine too. I'd even prefer it. In the meantime your patch looks good.

Good, it's now in master. Let's think about whether an error can be motivated, and how.
We usually don't prevent the user to do silly things, except when there is a strong reason to believe that it might be a serious mistake.

This bug report was last modified 2 years and 181 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #64128 regexp parser zero-width assertion bugs

GNU bug report logs - #64128
regexp parser zero-width assertion bugs