GNU bug report logs - #37659
rx additions: anychar, unmatchable, unordered-or

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Tue, 8 Oct 2019 09:37:01 UTC

Severity: wishlist

Tags: fixed, patch

Fixed in version 27.1

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37659 in the body.
You can then email your comments to 37659 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 08 Oct 2019 09:37:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 08 Oct 2019 09:37:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 8 Oct 2019 11:36:44 +0200
[Message part 1 (text/plain, inline)]
Three minor rx additions follow:

* Add `anychar' as an alias for `anything': the latter suggests an expression that can match any string, while in reality it only matches a single character. The documentation now uses `anychar' as the preferred name. (`any-char' would also be possible, but is longer.)

* Add `unmatchable' for a never-match regexp. This follows the previously introduced variable `regexp-unmatchable'.

* Add `unordered-or' as a variant of `or' without the left-to-right match order guarantee. It allows unconditional regexp-opt optimisations, and is particularly useful for matching sets of keywords. With rx-let and rx-define, it also has the potential for better compositionality, allowing expressions to be put together from smaller parts.

Abstractly: while `or' is associative, `unordered-or' is also commutative.

The name `unordered-or' is descriptive but phonetically (and lexically) somewhat weak. Strong alternatives welcome.

[0001-Add-anychar-as-alias-to-anything-in-rx.patch (application/octet-stream, attachment)]
[0002-Add-unmatchable-as-alias-for-or-in-rx.patch (application/octet-stream, attachment)]
[0003-Add-rx-unordered-or-construct.patch (application/octet-stream, attachment)]

Added tag(s) patch. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Tue, 08 Oct 2019 10:25:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Wed, 09 Oct 2019 09:00:02 GMT) Full text and rfc822 format available.

Message #10 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 37659 <at> debbugs.gnu.org
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 9 Oct 2019 10:59:43 +0200
[Message part 1 (text/plain, inline)]
Also consider changing the rendition of anychar/anything from ".\\|\n" to "[^z-a]", which is faster and does not allocate stack space. Previously, (* anything) wouldn't match large strings.

[0004-Use-z-a-for-matching-any-character-anychar-anything-.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Fri, 11 Oct 2019 23:08:02 GMT) Full text and rfc822 format available.

Message #13 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: Mattias Engdegård <mattiase <at> acm.org>
Date: Fri, 11 Oct 2019 16:07:25 -0700
Thanks for the proposed patch. Two thoughts:

1. Instead of the symbol 'unordered-or' (which is remarkably hard to 
read), I suggest using the ASCII letter 'V'. This ASCIIfies the Unicode 
symbol U+2228 LOGICAL OR (∨). If you prefer, you could make the Unicode 
symbol an alias for 'V', or use lower-case ASCII 'v', or whatever. The 
point is that '(unordered-or A B)' is too hard to read with all those 
'or's in there.

2. Re this patch:

> -    ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
> +    ((or 'anychar 'anything)      (cons (list "[^z-a]") t))

Is there a reason this uses (cons (list "[^z-a]") t) rather than 
'(("[^z-a]") . t) ? I realize neighboring code does something similar, 
but it's not clear to me why it's important to construct new objects 
here instead of using literals.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Sat, 12 Oct 2019 10:48:02 GMT) Full text and rfc822 format available.

Message #16 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: Mattias Engdegård <mattiase <at> acm.org>
Date: Sat, 12 Oct 2019 12:47:13 +0200
12 okt. 2019 kl. 01.07 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
> 
> 1. Instead of the symbol 'unordered-or' (which is remarkably hard to read), I suggest using the ASCII letter 'V'. This ASCIIfies the Unicode symbol U+2228 LOGICAL OR (∨). If you prefer, you could make the Unicode symbol an alias for 'V', or use lower-case ASCII 'v', or whatever. The point is that '(unordered-or A B)' is too hard to read with all those 'or's in there.

Definitely agree on the imperfections of 'unordered-or', and while I'd be the first to welcome more use of Unicode symbols, I'm not sure V (or v, or ∨) are very descriptive --- even if an alert reader intuits the rebus of 'V' (perhaps via \vee in TeX), there is no hint of the difference from 'or' or '|'.

Other suggestions:

'or*' --- follows the Lisp tradition of appending a star to get a variant and informs the reader that it's like 'or' but with a twist. The downside is that it might suggest a Kleene closure somehow.

'either', 'one-of', 'choose', 'pick-one', 'alternative', 'alt' --- very readable although the relationship to 'or' isn't quite clear. Perhaps they suggest a looser sense of ordering?

'unseq-or' --- a bit more readable and phonetically sharper than 'unordered-or', but it suggest a relation to 'seq'.

'nonstrict-or' --- abuses the familiar programming notion of strictness?

'or-ooo' --- will mostly make sense to the comp-arch crowd.

> Is there a reason this uses (cons (list "[^z-a]") t) rather than '(("[^z-a]") . t) ? I realize neighboring code does something similar, but it's not clear to me why it's important to construct new objects here instead of using literals.

Yes, there is a comment right above explaining that the returned value may be mutated (at least one use of mapcan). I tried doing it the other way, but neither was clearly better than the other (in performance or style), so I've let it stand for now. Nothing I feel strongly about either way.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Sun, 13 Oct 2019 16:53:01 GMT) Full text and rfc822 format available.

Message #19 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: Mattias Engdegård <mattiase <at> acm.org>
Date: Sun, 13 Oct 2019 09:52:36 -0700
On 10/12/19 3:47 AM, Mattias Engdegård wrote:

> there is no hint of the difference from 'or' or '|'.

That goal is secondary and can be dispensed with. It is OK to use a symbol that 
one must remember or look up. 'unordered-or' is simply too ungainly.

Of the names you suggest, 'alt' is the the best. But here's another idea: use 
'|' for unordered or, and 'or' for ORdered OR. Strictly speaking this would be 
an incompatible change, but I doubt whether many users will notice or care.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Sun, 13 Oct 2019 19:49:02 GMT) Full text and rfc822 format available.

Message #22 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: Mattias Engdegård <mattiase <at> acm.org>
Date: Sun, 13 Oct 2019 21:48:25 +0200
13 okt. 2019 kl. 18.52 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
> 
> Of the names you suggest, 'alt' is the the best. But here's another idea: use '|' for unordered or, and 'or' for ORdered OR. Strictly speaking this would be an incompatible change, but I doubt whether many users will notice or care.

That's an interesting notion, but am wary about the incompatibility; I'll have to think about it. Until then, let's consider 'alt' the default name. More (informed) opinions on the subject sought!

I take it the other changes -- 'anychar', [^z-a], and 'unmatchable' --- are less debatable; I'll push them in a day or two if nobody objects.

Thanks for taking an interest in naming, but the way; it isn't bikeshedding and deserves care.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 22 Oct 2019 15:15:02 GMT) Full text and rfc822 format available.

Message #25 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or 
Date: Tue, 22 Oct 2019 17:14:08 +0200
'regexp-opt' always generates a regexp preferring long matches. This is undocumented, but useful enough that I would be surprised if this property wasn't exploited (perhaps unknowingly) by callers. It's quite natural: given a set of strings, surely the caller want them all to be candidates for a match, even if there is no following anchoring pattern.

Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments. Thus, the rx user gets all the benefits from 'regexp-opt' in a composable way, without a need to sort the strings or otherwise prepare them.

(The old 'or' behaviour always used 'regexp-opt' when possible, which was very fragile: (or "a" "ab") would match "ab", but (or "a" "ab" digit) would just match "a". 'or-max' is robust, without surprises.)

Of course, we should also guarantee the maximum-matching property of regexp-opt. This is just a matter of documentation (and test); it does not restrict optimisations as far as I can tell.

Again, I'm open to suggestions about a better name than 'or-max'.

The other patches (anychar, unmatchable, and [^z-a]) have been pushed to master.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 22 Oct 2019 15:28:01 GMT) Full text and rfc822 format available.

Message #28 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 22 Oct 2019 17:27:48 +0200
>>>>> On Tue, 22 Oct 2019 17:14:08 +0200, Mattias Engdegård <mattiase <at> acm.org> said:

    Mattias> 'regexp-opt' always generates a regexp preferring long matches. This
    Mattias> is undocumented, but useful enough that I would be surprised if this
    Mattias> property wasn't exploited (perhaps unknowingly) by callers. It's quite
    Mattias> natural: given a set of strings, surely the caller want them all to be
    Mattias> candidates for a match, even if there is no following anchoring
    Mattias> pattern.

    Mattias> Thus, instead of 'unordered-or', define the operator in terms of long
    Mattias> matches: 'or-max' (working name) would work like 'or' but guarantee a
    Mattias> longest match, and only permit strings and 'or-max' forms as
    Mattias> arguments. Thus, the rx user gets all the benefits from 'regexp-opt'
    Mattias> in a composable way, without a need to sort the strings or otherwise
    Mattias> prepare them.

    Mattias> (The old 'or' behaviour always used 'regexp-opt' when possible, which
    Mattias> was very fragile: (or "a" "ab") would match "ab", but (or "a" "ab"
    Mattias> digit) would just match "a". 'or-max' is robust, without surprises.)

    Mattias> Of course, we should also guarantee the maximum-matching property of
    Mattias> regexp-opt. This is just a matter of documentation (and test); it does
    Mattias> not restrict optimisations as far as I can tell.

    Mattias> Again, I'm open to suggestions about a better name than 'or-max'.

or-greedy?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 22 Oct 2019 17:34:02 GMT) Full text and rfc822 format available.

Message #31 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 22 Oct 2019 10:33:40 -0700
On 10/22/19 8:14 AM, Mattias Engdegård wrote:
> 'regexp-opt' always generates a regexp preferring long matches. This is undocumented, but useful enough that I would be surprised if this property wasn't exploited (perhaps unknowingly) by callers. It's quite natural: given a set of strings, surely the caller want them all to be candidates for a match, even if there is no following anchoring pattern.

Yes, the longstanding tradition is that regular expressions are greedy.

> Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments.

That's an odd restriction. I'm not sure it's a good idea to add an 
operator with such a restriction. That is, I know why the restriction is 
there (it's because of limitations in the Emacs regexp matcher), but 
it's not clear that users should have to know and understand these details.

Moreover, if greed is the longstanding tradition for regexp-opt, 
shouldn't plain "or" be greedy, to be consistent with other operators? 
That is true for POSIX regular expressions involving "|". For example, 
the shell command:

echo abbc |
awk '{n=split($0, a, /b|bb/); for (i=1;i<=n;i++) print a[i]}'

outputs the two lines "a" and "c" (not the three lines "a", "", and "c") 
because the "b|bb" matches greedily.

If it's too much trouble to make plain "or" greedy, I suggest just 
documenting it as possibly being greedy and possibly not (that is, 
document it as being unordered, even if it happens to be ordered now). 
This will give us more opportunity for optimization later.

More generally, surely it would be better to improve the underlying 
Emacs regular expression matcher to have a greedy "or", or a stingy 
"or", or whatever.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Wed, 23 Oct 2019 09:17:02 GMT) Full text and rfc822 format available.

Message #34 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 23 Oct 2019 11:15:47 +0200
22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

>> Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments.
> 
> That's an odd restriction. I'm not sure it's a good idea to add an operator with such a restriction. That is, I know why the restriction is there (it's because of limitations in the Emacs regexp matcher), but it's not clear that users should have to know and understand these details.

The restriction is simple and easy to document. It is not necessary to know the underlying reason for it in order to use the construct effectively.

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?

Yes, I very much favour switching to a DFA engine; is there another way? Even then a backtracking engine would be needed for backrefs and other messy cases. However, that's a completely different amount of work. (Meanwhile, we have 'posix-string-match' etc for those who want greed at any cost.)

The problem that I'm trying to solve here is: how do we make it easy to match one of multiple strings --- keywords, say --- in rx? Currently, the answer is something like (regexp (regexp-opt my-keywords)), which doesn't integrate well with rx user definitions. In addition, the output of one regexp-opt cannot be used as input to another.

'or-max' would allow a user to say

(rx-define veggies (or-max "carrot" "tomato" "cucumber"))
(rx-define meats (or-max "beef" "chicken" "pork"))
... (rx (or-max veggies meats)) ...

and get a regexp that is guaranteed to be greedy, well-optimised as if all strings were passed to 'regexp-opt' at once, and robust: a small change won't change the behaviour radically, and the user won't have to game or second-guess the engine in order to produce the desired result.

If, in the future, 'or' becomes greedy, then 'or-max' will just be a synonym.

> If it's too much trouble to make plain "or" greedy, I suggest just documenting it as possibly being greedy and possibly not (that is, document it as being unordered, even if it happens to be ordered now). This will give us more opportunity for optimization later.

That would make rx strictly less useful than string regexps. That is why 'unordered-or' was a mistake: the unpredictability made it useless in many cases, and everyone would just have used regexp-opt (or skipped rx altogether).

It is desirable to have the semantics for 'or' in rx and \| in string regexps; otherwise, translating and understanding become unnecessarily difficult.

We could say that 'or' and \| either match greedily or in left-to-right order. However, I'm not sure this solves any problem right now.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Wed, 23 Oct 2019 23:15:01 GMT) Full text and rfc822 format available.

Message #37 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 23 Oct 2019 16:14:45 -0700
On 10/23/19 2:15 AM, Mattias Engdegård wrote:

> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?

If that's the real problem, perhaps the name should be "or-tokens" or 
something like that, to help remind the reader of the limitations of the 
proposed operator: it's meant only for greedy tokenization and it isn't 
suited for regular expressions in general. A problem with the name 
"or-max" is that it implies a more-general functionality than the 
implementation really has.

What happens if you apply or-tokens to arguments that aren't strings or 
other or-tokens? Does rx diagnose this? I hope it does.

> We could say that 'or' and \| either match greedily or in left-to-right order. However, I'm not sure this solves any problem right now.

I was thinking of something more-compatible: we could say that \| is 
left-to-right (for users who need compatibility with regexp "|"), and 
that 'or' is not necessarily left-to-right (to make room for future 
extensions that make 'or' greedy, or more efficient, or both).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 01:57:01 GMT) Full text and rfc822 format available.

Message #40 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Mattias Engdegård
 <mattiase <at> acm.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: RE: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 23 Oct 2019 18:56:26 -0700 (PDT)
Without wanting to butt in here, and knowing
nothing about what rx offers...

Is there an identifiable subset of rx features
(operators, functions, thingies, or whatever
its composable pieces are called) that map
(even if not one-to-one) to regexp syntax
components?

If so, are the things in that subset identified
as such in the doc?

Just wondering.  Wondering, in part, whether
use of rx might indirectly help someone learn
about Emacs regexp syntax and behavior.

Not saying that's important.  Just thought it
might be of some use, or at least interesting.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 08:59:02 GMT) Full text and rfc822 format available.

Message #43 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 10:58:43 +0200
[Message part 1 (text/plain, inline)]
24 okt. 2019 kl. 01.14 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
> 
>> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?
> 
> If that's the real problem, perhaps the name should be "or-tokens" or something like that, to help remind the reader of the limitations of the proposed operator: it's meant only for greedy tokenization and it isn't suited for regular expressions in general. A problem with the name "or-max" is that it implies a more-general functionality than the implementation really has.

'or-strings' then perhaps, since there is nothing really restricting it to 'tokens' (which is a bit hazardous terminology given that regexps are commonly used for tokenising). In particular, there is no delimiting; (or-max "IN" "OUT") will match the first part of "INSPECT", which may be unexpected of something ostensibly matching tokens.

On the other hand, 'or-strings' sort of precludes a future relaxation of the argument restriction.

> What happens if you apply or-tokens to arguments that aren't strings or other or-tokens? Does rx diagnose this? I hope it does.

Yes, of course. Working patch attached (it still uses the name 'or-max').

'or-max' isn't a vital addition; it just seemed to fill a gap, after experience with traditional regexp usage. It clearly shouldn't be added it on a whim. I wanted to get it in place for 27.1, but such a version rush has rarely resulted in good design.

> I was thinking of something more-compatible: we could say that \| is left-to-right (for users who need compatibility with regexp "|"), and that 'or' is not necessarily left-to-right (to make room for future extensions that make 'or' greedy, or more efficient, or both).

Sorry, by '\|' I meant the string regexp operator; I take it you propose separate semantics for the rx '|' and 'or' operators? Maybe we should worry about that if we ever get near the point of replacing the engine. There are other concerns, such as how capture groups are set (even if two branches match equally long texts).

I honestly don't think much would break if '\|' (in string regexps) became greedy overnight, but there is plenty of room to confuse the user if we introduce subtle distinctions between what has hitherto been perceived as synonyms.

[0003-Add-the-rx-or-max-operator.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 09:10:02 GMT) Full text and rfc822 format available.

Message #46 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 11:09:36 +0200
24 okt. 2019 kl. 03.56 skrev Drew Adams <drew.adams <at> oracle.com>:
> 
> Is there an identifiable subset of rx features
> (operators, functions, thingies, or whatever
> its composable pieces are called) that map
> (even if not one-to-one) to regexp syntax
> components?

Almost all of rx maps one-to-one to the string regexp syntax. The rx docs mention the corresponding string regexps for most forms.

In addition, rx is self-explaining in the sense that a user curious about what a particular expression means needs only to evaluate (rx SOMETHING) to get the translation. (There are external packages for going in the other direction.)





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 09:18:02 GMT) Full text and rfc822 format available.

Message #49 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Phil Sainty <psainty <at> orcon.net.nz>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: Mattias Engdegård <mattiase <at> acm.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 22:17:52 +1300
On 24/10/19 2:56 PM, Drew Adams wrote:
> Is there an identifiable subset of rx features ... that map
> (even if not one-to-one) to regexp syntax components?

C-h f rx

(syntax SYNTAX)  Match a character with syntax SYNTAX, being one of:
  whitespace, punctuation, word, symbol, open-parenthesis,
  close-parenthesis, expression-prefix, string-quote,
  paired-delimiter, escape, character-quote, comment-start,
  comment-end, string-delimiter, comment-delimiter





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 14:25:01 GMT) Full text and rfc822 format available.

Message #52 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 37659 <at> debbugs.gnu.org
Subject: RE: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 07:24:01 -0700 (PDT)
> > Is there an identifiable subset of rx features
> > (operators, functions, thingies, or whatever
> > its composable pieces are called) that map
> > (even if not one-to-one) to regexp syntax
> > components?
> 
> Almost all of rx maps one-to-one to the string regexp syntax. The rx
> docs mention the corresponding string regexps for most forms.
> 
> In addition, rx is self-explaining in the sense that a user curious
> about what a particular expression means needs only to evaluate (rx
> SOMETHING) to get the translation. (There are external packages for
> going in the other direction.)

Yes, I knew the last part - rx returns a regexp
(which you can examine).

My suggestion was really about documenting the
correspondence between individual rx components
and regexp components.  If that's already done,
great.  Thx.






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 24 Oct 2019 14:34:01 GMT) Full text and rfc822 format available.

Message #55 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Phil Sainty <psainty <at> orcon.net.nz>
Cc: Mattias Engdegård <mattiase <at> acm.org>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 37659 <at> debbugs.gnu.org
Subject: RE: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 07:32:59 -0700 (PDT)
> > Is there an identifiable subset of rx features ... that map
> > (even if not one-to-one) to regexp syntax components?
> 
> C-h f rx
> 
> (syntax SYNTAX)  Match a character with syntax SYNTAX, being one of:
>   whitespace, punctuation, word, symbol, open-parenthesis,
>   close-parenthesis, expression-prefix, string-quote,
>   paired-delimiter, escape, character-quote, comment-start,
>   comment-end, string-delimiter, comment-delimiter

Yes, that's fine for char syntax classes.  It's
good that their correspondences are listed.

But for, say, `line-start' (aka `bol') there is a
verbal description but no mention of the Elisp
regexp syntax that corresponds:

  ‘line-start’, ‘bol’
     matches the empty string, but only at the
     beginning of a line in the text being matched

That's the kind of thing I was suggesting.  It would
be helpful, I think, to mention (somewhere) that the
regexp syntax for this is "^".

Same thing for the other constructs (`string-start'
and all the rest).

I'm using Emacs 26.3.  I didn't find anything beyond
the doc string - nothing in the Emacs or Elisp manual
(which is OK).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Sun, 27 Oct 2019 11:54:01 GMT) Full text and rfc822 format available.

Message #58 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Sun, 27 Oct 2019 12:53:05 +0100
[Message part 1 (text/plain, inline)]
An observation is that 'or-max' cannot currently be defined by the user, because there is no way to expand rx forms explicitly. One way to fill that hole is to add the function

  (rx-expand-definitions RX-FORM)

which would expand RX-FORM until it no longer is a user-defined form.
This would permit or-max to be defined as

(rx-define or-max (&rest forms)
  (eval `(regexp ,(regexp-opt (or-max-strings (list forms))))))

(defun or-max-strings (args)
  (mapcan (lambda (item)
            (pcase item
              ((pred stringp) (list item))
              (`(or-max . ,rest) (or-max-strings rest))
              (_ (error "Illegal `or-max' argument: %S" item))))
          (mapcar #'rx-expand-definitions args)))

Of course, if the 'or-max' operator is generally useful, it would probably still make sense to define it as a primitive.

[0001-Add-rx-expand-definitions.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 11 Feb 2020 12:58:01 GMT) Full text and rfc822 format available.

Message #61 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Phil Sainty <psainty <at> orcon.net.nz>,
 Eli Zaretskii <eliz <at> gnu.org>
Cc: 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 13:57:27 +0100
[Message part 1 (text/plain, inline)]
22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?

Having second thoughts, I've come to believe that Paul may have been right after all. We might just as well let plain 'or' (alias '|') match as much as possible when it is able to do so. In particular, we should guarantee that this will happen when all arguments are strings, as used to be the case.

Initially I thought it was a bug that (or "a" "ab") was optimised into "ab?" on the grounds that this made the behaviour unpredictable: when matching the string "abc", (or "a" "ab") matched "ab", whereas (or "a" "ab" space) would match "a". However, the current 'fixed' code isn't necessarily more useful.

Since the change was introduced in Emacs 27 which has not yet been released, I suggest the attached patch for emacs-27. It reverts the use of regexp-opt with KEEP-ORDER = t. What do you think? It would solve the problem without introducing new constructs, and without running the risk of introducing subtle errors in existing rx expressions.

(In fact, if we do not do this in Emacs 27, we'd have to add a NEWS entry to warn users about the change.)

A further improvement would be to ensure that nested all-string 'or' forms would have the same property, and that expansion of user-defined forms would be transparent. In other words, that

 (rx-let ((x (or "abc" "de")))
   (rx (or "a" x (or "ab" "def"))))

would be equivalent to

 (rx "abc" "ab" "a" "def" "de")

I'll prepare a patch for this QoI improvement, but the attached patch should be required no matter what.

[0001-rx-Use-longest-match-for-all-string-or-forms-bug-376.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 11 Feb 2020 15:44:01 GMT) Full text and rfc822 format available.

Message #64 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: psainty <at> orcon.net.nz, eggert <at> cs.ucla.edu, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 17:43:41 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 11 Feb 2020 13:57:27 +0100
> Cc: 37659 <at> debbugs.gnu.org
> 
> Since the change was introduced in Emacs 27 which has not yet been released, I suggest the attached patch for emacs-27. It reverts the use of regexp-opt with KEEP-ORDER = t. What do you think? It would solve the problem without introducing new constructs, and without running the risk of introducing subtle errors in existing rx expressions.

Can't say I'm happy with these last-minute experiments, but okay.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Tue, 11 Feb 2020 19:19:02 GMT) Full text and rfc822 format available.

Message #67 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: psainty <at> orcon.net.nz, Paul Eggert <eggert <at> cs.ucla.edu>,
 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 20:17:50 +0100
[Message part 1 (text/plain, inline)]
11 feb. 2020 kl. 16.43 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Can't say I'm happy with these last-minute experiments, but okay.

Thanks, and I think it's actually a lesser experiment than status quo ante.
I'm allowing for more comments before pushing it; meanwhile, here is the follow-up patch mentioned earlier.

[0001-rx-Improve-or-compositionality.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Wed, 12 Feb 2020 00:53:01 GMT) Full text and rfc822 format available.

Message #70 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>,
 Eli Zaretskii <eliz <at> gnu.org>
Cc: psainty <at> orcon.net.nz, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 16:52:37 -0800
On 2/11/20 11:17 AM, Mattias Engdegård wrote:
>> Can't say I'm happy with these last-minute experiments, but okay.
> Thanks, and I think it's actually a lesser experiment than status quo ante.

I agree with both of you.

I assume the followon patch "rx: Improve 'or' compositionality" is for 
the master branch. I didn't look at it carefully but the basic idea is 
sound.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Wed, 12 Feb 2020 11:23:01 GMT) Full text and rfc822 format available.

Message #73 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Phil Sainty <psainty <at> orcon.net.nz>, Eli Zaretskii <eliz <at> gnu.org>,
 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Wed, 12 Feb 2020 12:22:06 +0100
12 feb. 2020 kl. 01.52 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> I assume the followon patch "rx: Improve 'or' compositionality" is for the master branch.

Right -- while it would make sense for emacs-27, as user-defined forms were introduced in that version, it is not strictly necessary and could well be done on master.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 13 Feb 2020 18:39:02 GMT) Full text and rfc822 format available.

Message #76 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Phil Sainty <psainty <at> orcon.net.nz>, Eli Zaretskii <eliz <at> gnu.org>,
 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 13 Feb 2020 19:38:41 +0100
Now the regexp-opt KEEP-ORDER argument no longer serves any purpose; it was added for use by rx, and has no obvious use elsewhere. It could safely be removed to save some weight, unless you prefer it be kept as scar tissue.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 13 Feb 2020 18:51:02 GMT) Full text and rfc822 format available.

Message #79 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Phil Sainty <psainty <at> orcon.net.nz>, Eli Zaretskii <eliz <at> gnu.org>,
 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 13 Feb 2020 10:50:41 -0800
On 2/13/20 10:38 AM, Mattias Engdegård wrote:
> Now the regexp-opt KEEP-ORDER argument no longer serves any purpose; it was added for use by rx, and has no obvious use elsewhere. It could safely be removed to save some weight, unless you prefer it be kept as scar tissue.

Simplest would be to remove it in Emacs 27, and merge that change into 
master.

If that's too drastic, we can mark it as deprecated in Emacs 27 (though 
it is odd for a release to add a feature that is immediately 
deprecated), and remove it in the master branch.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 13 Feb 2020 19:17:02 GMT) Full text and rfc822 format available.

Message #82 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Phil Sainty <psainty <at> orcon.net.nz>, Eli Zaretskii <eliz <at> gnu.org>,
 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 13 Feb 2020 20:16:28 +0100
[Message part 1 (text/plain, inline)]
13 feb. 2020 kl. 19.50 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> Simplest would be to remove it in Emacs 27, and merge that change into master.

Right, introducing it already deprecated doesn't make sense at all. I'll do it in emacs-27 if Eli doesn't mind too much (patch attached).

[0001-Remove-the-optional-KEEP-ORDER-argument-to-regexp-op.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 13 Feb 2020 19:31:01 GMT) Full text and rfc822 format available.

Message #85 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: psainty <at> orcon.net.nz, eggert <at> cs.ucla.edu, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 13 Feb 2020 21:30:39 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 13 Feb 2020 20:16:28 +0100
> Cc: Eli Zaretskii <eliz <at> gnu.org>, Phil Sainty <psainty <at> orcon.net.nz>,
>         37659 <at> debbugs.gnu.org
> 
> > Simplest would be to remove it in Emacs 27, and merge that change into master.
> 
> Right, introducing it already deprecated doesn't make sense at all. I'll do it in emacs-27 if Eli doesn't mind too much (patch attached).

Fine with me, thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Thu, 13 Feb 2020 22:24:02 GMT) Full text and rfc822 format available.

Message #88 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: psainty <at> orcon.net.nz, eggert <at> cs.ucla.edu, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 13 Feb 2020 23:23:01 +0100
13 feb. 2020 kl. 20.30 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Fine with me, thanks.

Thank you, pushed.

Now only the 'compositionality' patch remains. As noted it could be done on master, but since it is motivated by user-defined forms which were introduced in Emacs 27, I'd rather like it to be done in that branch.

It simply seems a bit incomplete otherwise. Users read about the longest-match guarantee of 'or', and of user-definitions, and look for a way of combining the two. Perhaps they try

(rx-let ((arith-op (or "+" "-" "*" "/"))
         (assign-op (or "=" "+=" "-=" "*=" "/="))
         (op (or arith-op assign-op)))
...)

which doesn't quite work for matching "+=", say.






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Fri, 14 Feb 2020 07:46:02 GMT) Full text and rfc822 format available.

Message #91 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: psainty <at> orcon.net.nz, eggert <at> cs.ucla.edu, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Fri, 14 Feb 2020 09:45:34 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 13 Feb 2020 23:23:01 +0100
> Cc: eggert <at> cs.ucla.edu, psainty <at> orcon.net.nz, 37659 <at> debbugs.gnu.org
> 
> Now only the 'compositionality' patch remains. As noted it could be done on master, but since it is motivated by user-defined forms which were introduced in Emacs 27, I'd rather like it to be done in that branch.
> 
> It simply seems a bit incomplete otherwise.

The important question is: is "incomplete" really so bad here.  We
have a few features that need to be completed in future releases, so
this one isn't alone.

I don't have a strong opinion either way.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Fri, 14 Feb 2020 16:16:02 GMT) Full text and rfc822 format available.

Message #94 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>, Mattias Engdegård
 <mattiase <at> acm.org>
Cc: psainty <at> orcon.net.nz, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Fri, 14 Feb 2020 08:15:43 -0800
On 2/13/20 11:45 PM, Eli Zaretskii wrote:
> I don't have a strong opinion either way.
How about this idea: if the compositionality patch affects only forms introduced 
in Emacs 27, then it's OK to put it into Emacs 27. If it affects forms that have 
been around since Emacs 26, it might be safer to put the patch into the master 
branch. (I haven't looked at the patch in detail.)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Fri, 14 Feb 2020 20:50:01 GMT) Full text and rfc822 format available.

Message #97 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: psainty <at> orcon.net.nz, Eli Zaretskii <eliz <at> gnu.org>, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Fri, 14 Feb 2020 21:49:01 +0100
14 feb. 2020 kl. 17.15 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> How about this idea: if the compositionality patch affects only forms introduced in Emacs 27, then it's OK to put it into Emacs 27. If it affects forms that have been around since Emacs 26, it might be safer to put the patch into the master branch.

It is a reasonable criterion. In theory the patch might affect existing code, since it enables maximal matching for nested 'or' trees of string literals. The expression

(rx (or (or "a" "abc") (or "ab" "abcd")))

will currently not match the whole string "abcd", but with the patch, it would, as if flattened. The reasoning is that this is more useful, composable, and always what the user wants.

Whether this makes a difference for actually existing code is anyone's guess.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37659; Package emacs. (Sun, 01 Mar 2020 10:10:02 GMT) Full text and rfc822 format available.

Message #100 received at 37659 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: psainty <at> orcon.net.nz, Eli Zaretskii <eliz <at> gnu.org>, 37659 <at> debbugs.gnu.org
Subject: Re: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Sun, 1 Mar 2020 11:09:37 +0100
I've belatedly made up my mind about this and now firmly believe that the compositioning xr patch belongs in emacs-27 and have therefore pushed it to that branch. Sorry about the long deliberation.





Added tag(s) fixed. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Wed, 04 Mar 2020 17:43:02 GMT) Full text and rfc822 format available.

bug marked as fixed in version 27.1, send any further explanations to 37659 <at> debbugs.gnu.org and Mattias Engdegård <mattiase <at> acm.org> Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Wed, 04 Mar 2020 17:43:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 02 Apr 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 23 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.