GNU bug report logs - #34641
rx: (or ...) order unpredictable

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Sun, 24 Feb 2019 18:41:02 UTC

Severity: normal

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 34641 in the body.
You can then email your comments to 34641 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sun, 24 Feb 2019 18:41:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 24 Feb 2019 18:41:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: rx: (or ...) order unpredictable
Date: Sun, 24 Feb 2019 19:40:33 +0100
The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example,

(rx (or "ab" "a") (or "a" "ab"))
=>
"\\(?:ab?\\)\\(?:ab?\\)"

The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent. Not having this control makes rx useless or dangerous for many purposes.

The reason for the reordering is the use of regex-opt behind the scenes. Whether rx is the place to do this kind of optimisation is a matter of opinion; mine is that it belongs in the regexp engine, together with other, more aggressive optimisations (DFA, native-code generation, etc) could be performed as well.

We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution.

(Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sun, 24 Feb 2019 19:07:01 GMT) Full text and rfc822 format available.

Message #8 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 24 Feb 2019 21:06:34 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 24 Feb 2019 19:40:33 +0100
> 
> We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution.

Your preferred solution is fine with me, FWIW.

> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)

Fix it, I think.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sun, 24 Feb 2019 21:19:01 GMT) Full text and rfc822 format available.

Message #11 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 24 Feb 2019 22:18:28 +0100
[Message part 1 (text/plain, inline)]
24 feb. 2019 kl. 20.06 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> Your preferred solution is fine with me, FWIW.

Thank you; patch attached.

>> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)
> 
> Fix it, I think.

I'll prepare another patch. Is there a preferred or particularly clever never-matching regexp? If not, would \(?:$\)A do?
[0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sun, 24 Feb 2019 22:46:02 GMT) Full text and rfc822 format available.

Message #14 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: "Basil L. Contovounesios" <contovob <at> tcd.ie>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 24 Feb 2019 22:44:57 +0000
Mattias Engdegård <mattiase <at> acm.org> writes:

> Is there a preferred or particularly clever never-matching regexp?
> If not, would \(?:$\)A do?

FWIW, CC Mode has used "a\\`" since the following discussion:

https://lists.gnu.org/archive/html/emacs-devel/2018-03/msg00876.html

Stefan also suggested to make a variable out of this, but I don't think
anything came of that:

https://lists.gnu.org/archive/html/emacs-devel/2018-04/msg00047.html

-- 
Basil




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Mon, 25 Feb 2019 02:39:01 GMT) Full text and rfc822 format available.

Message #17 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Noam Postavsky <npostavs <at> gmail.com>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 24 Feb 2019 21:37:56 -0500
On Sun, 24 Feb 2019 at 13:41, Mattias Engdegård <mattiase <at> acm.org> wrote:
>
> The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example,
>
> (rx (or "ab" "a") (or "a" "ab"))
> =>
> "\\(?:ab?\\)\\(?:ab?\\)"
>
> The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent.

I don't see the problem, isn't "ab?" semantically equivalent to
"ab\\|a" (and "a\\|ab")?

> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.

This sounds familiar, though I can't locate a report for it.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Mon, 25 Feb 2019 09:57:02 GMT) Full text and rfc822 format available.

Message #20 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Noam Postavsky <npostavs <at> gmail.com>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Mon, 25 Feb 2019 10:56:44 +0100
25 feb. 2019 kl. 03.37 skrev Noam Postavsky <npostavs <at> gmail.com>:
> 
> I don't see the problem, isn't "ab?" semantically equivalent to
> "ab\\|a" (and "a\\|ab")?

Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used.

Suppose we are matching against the string "abc". Then
ab\|a matches "ab"
a\|ab matches "a"
ab?   matches "ab"
ab??  matches "a" (non-greedy operator)

(I remember writing, young and foolish, [0-9]+\|0[xX][0-9a-fA-F]+ to match a number in decimal or hex, and was surprised that all hex numbers were zero.)

>> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.
> 
> This sounds familiar, though I can't locate a report for it.

If you do remember, please tell us about it.
The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Mon, 25 Feb 2019 14:27:01 GMT) Full text and rfc822 format available.

Message #23 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: "Basil L. Contovounesios" <contovob <at> tcd.ie>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Mon, 25 Feb 2019 15:26:18 +0100
[Message part 1 (text/plain, inline)]
24 feb. 2019 kl. 23.44 skrev Basil L. Contovounesios <contovob <at> tcd.ie>:
> 
> FWIW, CC Mode has used "a\\`" since the following discussion:

Thank you, I'll use that then.
Here is a patch (to be applied after the other one).
[0001-Correct-regexp-opt-return-value-for-empty-string-lis.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Mon, 25 Feb 2019 14:44:02 GMT) Full text and rfc822 format available.

Message #26 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Noam Postavsky <npostavs <at> gmail.com>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Mon, 25 Feb 2019 09:43:07 -0500
On Mon, 25 Feb 2019 at 04:56, Mattias Engdegård <mattiase <at> acm.org> wrote:

> Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used.
>
> Suppose we are matching against the string "abc". Then
> ab\|a matches "ab"
> a\|ab matches "a"

Oh, huh. So it does. I guess I've never used regexp in a situation
where this subtle corner case would come up.

> >> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.
> >
> > This sounds familiar, though I can't locate a report for it.
>
> If you do remember, please tell us about it.
> The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps.

Right, found it this time, it's Bug#20307.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Mon, 25 Feb 2019 14:49:02 GMT) Full text and rfc822 format available.

Message #29 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Noam Postavsky <npostavs <at> gmail.com>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Mon, 25 Feb 2019 15:48:31 +0100
25 feb. 2019 kl. 15.43 skrev Noam Postavsky <npostavs <at> gmail.com>:
> 
> Right, found it this time, it's Bug#20307.

Excellent! I'll move this part to that bug then, and update the patch to use that bug number.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sat, 02 Mar 2019 12:34:01 GMT) Full text and rfc822 format available.

Message #32 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 02 Mar 2019 14:33:05 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 24 Feb 2019 22:18:28 +0100
> Cc: 34641 <at> debbugs.gnu.org
> 
> > Your preferred solution is fine with me, FWIW.
> 
> Thank you; patch attached.

Thanks, LGTM with minor comments below.

> +The optional argument @var{noreorder}, if @code{nil}, allows the
> +returned regexp to match the strings in any order.  If non-@code{nil},
> +the regexp is equivalent to a chain of alternatives (by the @samp{\|}
> +operator) of the strings in the order given.

I find the text in NEWS much more clear regarding what happens when
the new arg is non-nil.  I think what is missing is a more explicit
description you have in NEWS:

  If the new third argument is non-nil, the match is
  guaranteed to be performed in the order given, as if the strings were
  made into a regexp by joining them with '\|'.

So I suggest to mention explicitly the "match is guaranteed to be
performed in the order given" part.

> +The optional argument NOREORDER, if nil, allows the returned

We usually say "if nil or omitted" for optional arguments.

> +(defun regexp-opt--contains-prefix (strings)
> +  "Whether a list of strings contains a proper prefix of one of its elements.
> +STRINGS must be sorted and free from duplicates."

It is usually a good idea to refer to arguments explicitly in the
first sentence of a doc string.  In this case, I think just up-casing
STRINGS there should be enough.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sat, 02 Mar 2019 14:06:01 GMT) Full text and rfc822 format available.

Message #35 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 2 Mar 2019 15:05:08 +0100
2 mars 2019 kl. 13.33 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> I find the text in NEWS much more clear regarding what happens when
> the new arg is non-nil.  I think what is missing is a more explicit
> description you have in NEWS:
> 
>  If the new third argument is non-nil, the match is
>  guaranteed to be performed in the order given, as if the strings were
>  made into a regexp by joining them with '\|'.
> 
> So I suggest to mention explicitly the "match is guaranteed to be
> performed in the order given" part.

Right; rephrased the doc string and searching.texi.

>> +The optional argument NOREORDER, if nil, allows the returned
> 
> We usually say "if nil or omitted" for optional arguments.

Understood; used in both places.

>> +(defun regexp-opt--contains-prefix (strings)
>> +  "Whether a list of strings contains a proper prefix of one of its elements.
>> +STRINGS must be sorted and free from duplicates."
> 
> It is usually a good idea to refer to arguments explicitly in the
> first sentence of a doc string.  In this case, I think just up-casing
> STRINGS there should be enough.

Rephrased.

Thanks for the review; revised patch attached.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sat, 02 Mar 2019 14:09:02 GMT) Full text and rfc822 format available.

Message #38 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 2 Mar 2019 15:08:26 +0100
[Message part 1 (text/plain, inline)]
2 mars 2019 kl. 15.05 skrev Mattias Engdegård <mattiase <at> acm.org>:
> 
> Thanks for the review; revised patch attached.

Sorry, here it is.

[0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sat, 02 Mar 2019 14:24:02 GMT) Full text and rfc822 format available.

Message #41 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 02 Mar 2019 16:23:08 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 2 Mar 2019 15:08:26 +0100
> Cc: 34641 <at> debbugs.gnu.org
> 
> > Thanks for the review; revised patch attached.
> 
> Sorry, here it is.

LGTM, thanks.




Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Sat, 02 Mar 2019 14:38:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Sat, 02 Mar 2019 14:38:02 GMT) Full text and rfc822 format available.

Message #46 received at 34641-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34641-done <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 2 Mar 2019 15:37:26 +0100
2 mars 2019 kl. 15.23 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> LGTM, thanks.

Thank you, pushed.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sat, 02 Mar 2019 23:49:01 GMT) Full text and rfc822 format available.

Message #49 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Phil Sainty <psainty <at> orcon.net.nz>
To: Eli Zaretskii <eliz <at> gnu.org>, Mattias Engdegård
 <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 3 Mar 2019 12:48:33 +1300
>> +The optional argument NOREORDER, if nil, allows the returned

Could we change that name to have a positive sense?

Boolean arguments with a negative sense/meaning are invariably
more awkward to read than ones with a positive meaning, IMO.

NOREORDER set to nil means "No NOREORDER" (aka "Reorder").  Those
double-negatives should be avoided whenever it's simple do do so,
as they make things harder for anyone reading the documentation.

We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything
along those lines, and then a 'true' value matches the positive
sense of the name, which is much nicer.


-Phil




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Sun, 03 Mar 2019 08:55:02 GMT) Full text and rfc822 format available.

Message #52 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Phil Sainty <psainty <at> orcon.net.nz>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Sun, 3 Mar 2019 09:54:13 +0100
3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty <at> orcon.net.nz>:
> 
>>> +The optional argument NOREORDER, if nil, allows the returned
> 
> Could we change that name to have a positive sense?
> 
> Boolean arguments with a negative sense/meaning are invariably
> more awkward to read than ones with a positive meaning, IMO.
> 
> NOREORDER set to nil means "No NOREORDER" (aka "Reorder").  Those
> double-negatives should be avoided whenever it's simple do do so,
> as they make things harder for anyone reading the documentation.
> 
> We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything
> along those lines, and then a 'true' value matches the positive
> sense of the name, which is much nicer.

You are right, and I wasn't happy with the negative name either.
Would KEEP-ORDER do?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34641; Package emacs. (Thu, 07 Mar 2019 09:01:02 GMT) Full text and rfc822 format available.

Message #55 received at 34641 <at> debbugs.gnu.org (full text, mbox):

From: Phil Sainty <psainty <at> orcon.net.nz>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34641 <at> debbugs.gnu.org
Subject: Re: bug#34641: rx: (or ...) order unpredictable
Date: Thu, 7 Mar 2019 22:00:27 +1300
On 3/03/19 9:54 PM, Mattias Engdegård wrote:
> 3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty <at> orcon.net.nz>:
>>>> +The optional argument NOREORDER, if nil, allows the returned
>> Could we change that name to have a positive sense?
> 
> You are right, and I wasn't happy with the negative name either.
> Would KEEP-ORDER do?

I think that's good, and no one has objected, so I think you could
go ahead with that change.

cheers,
-Phil




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 04 Apr 2019 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 22 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.