GNU bug report logs -
#64128
regexp parser zero-width assertion bugs
Previous Next
To reply to this bug, email your comments to 64128 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sat, 17 Jun 2023 12:21:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Sat, 17 Jun 2023 12:21:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,
\b*
matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:
xy\b*
is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:
(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
(match-data))
=> (0 18)
Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)
Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)
These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.
A modest improvement would be to make operators become literal after any zero-width assertion, so that
\<*
becomes (: word-start "*") instead of (* word-start), and
xy\b*
becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).
Suggested patch attached.
[regexp-zero-width-assertion-bug.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sat, 17 Jun 2023 18:45:01 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
> (and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
> (match-data))
> => (0 18)
That's so bizarre that it feels like we really should try and preserve
it for posterity.
Not.
> These regexp patterns should be very rare in practice: they should
> always be a mistake, but it would be nice if they behaved in a way
> that makes some kind of sense.
>
> A modest improvement would be to make operators become literal after
> any zero-width assertion, so that
I think the behavior that makes most sense is to signal an error when
compiling the regexp.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sat, 17 Jun 2023 20:09:01 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> I think the behavior that makes most sense is to signal an error when
> compiling the regexp.
Clearly, but some behaviour needs to be preserved for compatibility.
Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sat, 17 Jun 2023 22:19:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 64128 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2023-06-17 13:07, Mattias Engdegård wrote:
> 17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
>
>> I think the behavior that makes most sense is to signal an error when
>> compiling the regexp.
>
> Clearly, but some behaviour needs to be preserved for compatibility.
> Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
>
doc/lispref/searching.texi says that "*" is treated as an ordinary
character if it is in a context where its special meaning makes no
sense, giving "*foo" as an example. If we break with this tradition by
making "\b*" an error instead of being equivalent to "\b\*", we should
update that part of the manual.
One possible way forward is to update doc/lispref/searching.texi to
specify what we want. Then we can modify the code to match the updated
documentation.
In my experience, modifying the doc is often the hard part, so I took a
crack at that in the draft proposed patch, which I have not installed.
Comments?
[0001-Document-that-b-etc-are-now-invalid-regexps.patch (text/x-patch, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sun, 18 Jun 2023 04:56:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 64128 <at> debbugs.gnu.org (full text, mbox):
> Cc: 64128 <at> debbugs.gnu.org
> Date: Sat, 17 Jun 2023 15:18:00 -0700
> From: Paul Eggert <eggert <at> cs.ucla.edu>
>
> > Clearly, but some behaviour needs to be preserved for compatibility.
> > Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> >
>
> doc/lispref/searching.texi says that "*" is treated as an ordinary
> character if it is in a context where its special meaning makes no
> sense, giving "*foo" as an example. If we break with this tradition by
> making "\b*" an error instead of being equivalent to "\b\*", we should
> update that part of the manual.
>
> One possible way forward is to update doc/lispref/searching.texi to
> specify what we want. Then we can modify the code to match the updated
> documentation.
>
> In my experience, modifying the doc is often the hard part, so I took a
> crack at that in the draft proposed patch, which I have not installed.
>
> Comments?
My comment is that since this was a documented feature, I'm not
interested in making it an error.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Sun, 18 Jun 2023 20:27:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 64128 <at> debbugs.gnu.org (full text, mbox):
18 juni 2023 kl. 06.55 skrev Eli Zaretskii <eliz <at> gnu.org>:
> My comment is that since this was a documented feature, I'm not
> interested in making it an error.
Yes, it would be unwise to raise an error for "^*" or the like; it's in active use.
The manual is a bit hazy about what we actually promise, though.
As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?
To make everything clear, we have to groups of zero-width assertions:
Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=
Group B assertions work like ordinary elements, syntactically and semantically. Simple, predictable, but also useless.
Group A assertions are more interesting: either there is nothing before a train of such assertions, such as
"^\\`\\b\\`*?"
which turns the first character of the operator into a literal (and a second character, if present, now becomes an operator acting on that literal).
Or there is something, and the operator acts on the last element preceding the assertions, except that multiple literal characters coalesce to a single element. Except if one of the literal chars is an out-of-place `^` which splits a sequence of literals into separate segments but not exactly where you think it would.
For example,
"abc^def\\B\\B+?"
means, I think,
(seq "ab" (+? "c^def" not-word-boundary not-word-boundary))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 03:05:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 64128 <at> debbugs.gnu.org (full text, mbox):
> To make everything clear, we have to groups of zero-width assertions:
>
> Group A: ^ $ \` \' \b \B
IIRC `^` is only special if it's at the beginning of a group, so `^*` will
always treat this * as a literal, right?
"Similarly" `$` is only special if it's at the end of a group, so `$*` will
always be a repetition of the $ character no?
So the remaining problematic elements are \` \' \b and \B
I suspect if we don't want to signal errors, the next best thing is to
treat them like group B.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 08:45:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 64128 <at> debbugs.gnu.org (full text, mbox):
19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> `^` is only special if it's at the beginning of a group, so `^*` will
> always treat this * as a literal, right?
> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
> always be a repetition of the $ character no?
Yes, ^ and $ have additional rules for when they are plain literals and not subject to these bugs at all.
The literal-splitting powers of ^ have now (075e77ac44) been removed.
> So the remaining problematic elements are \` \' \b and \B
\`* has been observed, so we probably need to keep that working as well.
> I suspect if we don't want to signal errors, the next best thing is to
> treat them like group B.
Yes, maybe; they are less likely to be followed by an operator-literal, but it would also be good to have all zero-width assertions work the same way.
On the other hand, it can't be worse than we have now, as long as we get rid of the "quack,\\b*" semantics.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 12:55:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 64128 <at> debbugs.gnu.org (full text, mbox):
I wish there was a way to emit warnings about oddball constructs
(starting with the "* is literal when encountered at the beginning of
a regexp").
Stefan
Mattias Engdegård [2023-06-19 10:44:04] wrote:
> 19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
>
>> `^` is only special if it's at the beginning of a group, so `^*` will
>> always treat this * as a literal, right?
>> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
>> always be a repetition of the $ character no?
>
> Yes, ^ and $ have additional rules for when they are plain literals and not
> subject to these bugs at all.
>
> The literal-splitting powers of ^ have now (075e77ac44) been removed.
>
>> So the remaining problematic elements are \` \' \b and \B
>
> \`* has been observed, so we probably need to keep that working as well.
>
>> I suspect if we don't want to signal errors, the next best thing is to
>> treat them like group B.
>
> Yes, maybe; they are less likely to be followed by an operator-literal, but
> it would also be good to have all zero-width assertions work the same way.
> On the other hand, it can't be worse than we have now, as long as we get rid
> of the "quack,\\b*" semantics.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 18:15:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 64128 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2023-06-18 13:26, Mattias Engdegård wrote:
> The manual is a bit hazy about what we actually promise, though.
>
> As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?
Although it's not easy to document, we should do better. I gave that a
shot by installing the attached patches into the master branch. These
patches try to document current behavior, including warning about the
squirrelly behavior you mention. If/when we fix the squirrelly behavior
we can change that part of the manual accordingly.
The last of the three patches is merely a terminology change: it
standardizes on the term "bracket expression" for regexps like [a-z].
Formerly the doc and comments were inconsistent about the terminology.
It's better to stick with the POSIX term here, to avoid confusion. I
myself got confused about this when editing the other two patches.
Comments welcome as usual.
[0001-Document-regular-expression-special-cases-better.patch (text/x-patch, attachment)]
[0002-Document-Emacs-vs-POSIX-REs.patch (text/x-patch, attachment)]
[0003-Call-them-bracket-expressions-more-consistently.patch (text/x-patch, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 18:35:01 GMT)
Full text and
rfc822 format available.
Message #35 received at 64128 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
19 juni 2023 kl. 14.54 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
>
> I wish there was a way to emit warnings about oddball constructs
> (starting with the "* is literal when encountered at the beginning of
> a regexp").
I agree, but I'm more of a static analysis man. (And relint does complain about all these cases as long as the regexp is detected as such, so there probably aren't many of them left in the Emacs tree.)
Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
Maybe we should change group B assertions so that they work in the same way.
[regexp-zero-width-assertion-noquack.diff (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 19:22:01 GMT)
Full text and
rfc822 format available.
Message #38 received at 64128 <at> debbugs.gnu.org (full text, mbox):
On 2023-06-19 11:34, Mattias Engdegård wrote:
> Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
> Maybe we should change group B assertions so that they work in the same way.
> - operand. Reset at the beginning of groups and alternatives. */
> + operand. Reset at the beginning of groups and alternatives,
> + and after zero-width assertions which should not be the target
> + of any postfix repetition operators. */
If I understand things correctly, this would cause "\b*c" to be treated
like "\b\*c". If so, it's headed in the wrong direction.
It's long been documented that the only reason "*" is ordinary at the
start of a regular expression or subexpression is "historical
compatibility", and it's also long been documented that you shouldn't
take advantage of this and you should backslash-escape the "*" anyway.
In contrast, for constructs like \b* there is not a historical
compatibility reason, so there's not a good argument for treating "*" as
an ordinary character after "\b".
Instead, \b should not be a special case before "*", and \b* should be
equivalent to \(\b\)* and should match only the empty string. Similarly
for the other zero-width backslash escapes. This is what I would expect
from these constructs from the longstanding documentation.
If we instead added a rule to say that a construct that can only match
the empty string causes following "*" to ordinary, then \b* and \(\b\)*
would both be equivalent to \*. Although consistent, this would be
confusing: it would compound the historical-compatibility mistake. Let's
keep things simple instead.
Also, whatever change we make to the behavior should be documented in
the manual and in etc/NEWS.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 19:53:01 GMT)
Full text and
rfc822 format available.
Message #41 received at 64128 <at> debbugs.gnu.org (full text, mbox):
19 juni 2023 kl. 21.21 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".
Actually it already works that way. What the patch does, is preventing AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which I think we can all agree is less wrong.
You can check the test cases in the patch:
(should (equal (string-match "q\\b*!" "q*!") 0))
(should (equal (string-match "q\\b*!" "!") nil))
which in current Emacs produce 2 and 0 respectively.
> It's long been documented that the only reason "*" is ordinary at the start of a regular expression or subexpression is "historical compatibility", and it's also long been documented that you shouldn't take advantage of this and you should backslash-escape the "*" anyway. In contrast, for constructs like \b* there is not a historical compatibility reason, so there's not a good argument for treating "*" as an ordinary character after "\b".
Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.
We also have \` to consider -- I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.
> Instead, \b should not be a special case before "*", and \b* should be equivalent to \(\b\)* and should match only the empty string. Similarly for the other zero-width backslash escapes. This is what I would expect from these constructs from the longstanding documentation.
>
> If we instead added a rule to say that a construct that can only match the empty string causes following "*" to ordinary, then \b* and \(\b\)* would both be equivalent to \*. Although consistent, this would be confusing: it would compound the historical-compatibility mistake. Let's keep things simple instead.
Yes, I definitely would be confused by such semantics.
> Also, whatever change we make to the behavior should be documented in the manual and in etc/NEWS.
Will be happy to oblige, although in this case it really just was a bug fix.
What I really would like to see is the regexp parser somehow separated from the NFA bytecode generator, which would make both clearer. The parser could then be re-used for other purposes such as a different back-end (DFA construction) or a built-in xr-like converter.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 20:10:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 64128 <at> debbugs.gnu.org (full text, mbox):
>> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".
> Actually it already works that way. What the patch does, is preventing
> AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which
> I think we can all agree is less wrong.
Hmm... maybe it's less wrong, but I'd rather make it behave like
AB\(\b\)*C, which is, I'd argue, even less wrong.
Or maybe make it signal an error: I can't imagine that the current
behavior is used by very much code at all, seeing how it's so
seriously non-intuitive.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Mon, 19 Jun 2023 20:41:01 GMT)
Full text and
rfc822 format available.
Message #47 received at 64128 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2023-06-19 12:52, Mattias Engdegård wrote:
> Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.
OK, but we still need to fix this, as \b and \B should not be a special
case for following "*".
> I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.
That makes some sense, in that \` is like ^, and ^ is already a special
case (this is true even in POSIX BREs).
In other words, how about if we change the groups from your list:
Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=
to this:
Group A: ^ \`
Group B: $ \' \b \B \< \> \_< \_> \=
where "*" is ordinary after Group A, and special after Group B and there
is no other squirrelly behavior. And similarly for the other repetition
operators.
Attached is a proposed doc change for this, which I have not installed.
Of course the code and etc/NEWS would need changing too.
[0001-Document-proposed-regex-fix-bug-64128.patch (text/x-patch, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Tue, 20 Jun 2023 11:37:02 GMT)
Full text and
rfc822 format available.
Message #50 received at 64128 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
19 juni 2023 kl. 22.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> Hmm... maybe it's less wrong, but I'd rather make it behave like
> AB\(\b\)*C, which is, I'd argue, even less wrong.
I agree, and you are probably right that it's safe to do that.
> Or maybe make it signal an error: I can't imagine that the current
> behavior is used by very much code at all, seeing how it's so
> seriously non-intuitive.
That might be even better if we can get away with it.
19 juni 2023 kl. 22.40 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
> In other words, how about if we change the groups from your list:
>
> Group A: ^ $ \` \' \b \B
> Group B: \< \> \_< \_> \=
>
> to this:
>
> Group A: ^ \`
> Group B: $ \' \b \B \< \> \_< \_> \=
>
> where "*" is ordinary after Group A, and special after Group B and there is no other squirrelly behavior. And similarly for the other repetition operators.
Sounds fine, with the option to go full error on group B if we agree that that's even better.
> Attached is a proposed doc change for this, which I have not installed.
Thank you, it has been incorporated in the attached patch which follows your suggestions above.
Your previous regexp doc updates are most appreciated. I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.
[0001-Straighten-regexp-postfix-operator-after-zero-width-.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Wed, 21 Jun 2023 06:09:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 64128 <at> debbugs.gnu.org (full text, mbox):
On 2023-06-20 04:36, Mattias Engdegård wrote:
> Sounds fine, with the option to go full error on group B if we agree that that's even better.
That would be fine too. I'd even prefer it. In the meantime your patch
looks good.
> I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.
Agreed.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64128
; Package
emacs
.
(Wed, 21 Jun 2023 15:58:02 GMT)
Full text and
rfc822 format available.
Message #56 received at 64128 <at> debbugs.gnu.org (full text, mbox):
21 juni 2023 kl. 08.08 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
>> Sounds fine, with the option to go full error on group B if we agree that that's even better.
>
> That would be fine too. I'd even prefer it. In the meantime your patch looks good.
Good, it's now in master. Let's think about whether an error can be motivated, and how.
We usually don't prevent the user to do silly things, except when there is a strong reason to believe that it might be a serious mistake.
This bug report was last modified 1 year and 159 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.