GNU bug report logs - #37849
composable character alternatives in rx

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Mon, 21 Oct 2019 10:25:01 UTC

Severity: normal

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37849 in the body.
You can then email your comments to 37849 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Mon, 21 Oct 2019 10:25:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 21 Oct 2019 10:25:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: composable character alternatives in rx
Date: Mon, 21 Oct 2019 12:24:21 +0200
[Message part 1 (text/plain, inline)]
Now that rx is user-extendible, some holes are showing. Example (from python.el):

      (simple-operator      . ,(rx (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%)))
      ;; FIXME: rx should support (not simple-operator).
      (not-simple-operator  . ,(rx
                                (not
                                 (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%))))

(This code uses the old rx-constituents mechanism, but the point applies equally to new-style definitions.)
More generally, there is currently no way to:

(1) Get the complement of a defined (any ...) form
(2) Get the union of two defined (any ...) forms
(3) Get the intersection of two defined (not (any ...)) forms

(1), which the example above was about, could be solved by expanding definitions inside 'not'. This is a step away from the principle that user-defined things are only allowed where general rx forms are, but perhaps tolerable. Proposed patch attached.

(2) can be solved by expanding definitions inside 'any', and allowing 'any' inside 'any' (flattening). Not sure I like this.

An alternative is to ensure that (or (any X) (any Y)) -> (any X Y), but then we either need to allow 'or' inside 'not', or add an intersection operator:

  (intersect (not (any X)) (not (any Y)) -> (not (any X Y))

We could also make 'not' variadic, turning it into complement-of-union:

  (not (any A) (any B)) -> (not (any A B))

Olin Shivers's SRE has a complete and closed set of operations on character sets (https://scsh.net/docu/post/sre.html). That would be principled and perhaps useful, but difficult to do fully in rx because not all such expressions can be rendered into Emacs regexps. Nothing prevents us from making a partial implementation, however.

[0001-Expand-rx-definitions-inside-not.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Sun, 27 Oct 2019 09:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 37849 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 37849 <at> debbugs.gnu.org
Subject: bug#37849: composable character alternatives in rx 
Date: Sun, 27 Oct 2019 10:17:40 +0100
Expansion inside (not ...) should be uncontroversial; now pushed (cbd439e785).

Character set operators (union, intersection, difference) would be useful. Consider:

(rx-define ident-chars (any "a-zA-Z0-9"))
(rx-define operator-chars (any ?+ ?- ?* ?/ ?< ?> ?=))

There is then currently no way to form the set of characters that excludes both the above sets.






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Fri, 06 Dec 2019 21:59:02 GMT) Full text and rfc822 format available.

Message #11 received at 37849 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 37849 <at> debbugs.gnu.org
Subject: Re: bug#37849: composable character alternatives in rx  
Date: Fri, 6 Dec 2019 22:58:46 +0100
[Message part 1 (text/plain, inline)]
This patch adds `union' and `intersection' to rx. They both take zero or more charsets as arguments. A charset is either an `any' form that does not contain character classes, a `union' or `intersection' form, or a `not' form with charset argument.

Example:

(rx (union (any "a-f") (any "b-m")))
=> "[a-m]"

(rx (intersection (any "a-f") (any "b-m")))
=> "[b-f]"

The character class limitation stems from the inability to complement or intersect classes in general. It would be possible to partially lift this restriction for `union'; it is clear that

(rx (union (any "ab" space) (any "bc" space digit)))
=> "[abc[:space:][:digit:]]"

but it makes the facility harder to explain to the user in a way that makes sense. Still, it could be a future extension.

A `difference' operator was not included but could be added; it is trivially defined in rx as

(rx-define difference (a b)
  (intersection a (not b)))

The names `union' and `intersection' are verbose, but should be rare enough that it's better with something descriptive.
SRE, from where the concept was taken, uses `|' and `&' respectively, and `~' for complement, `-' for difference.

[0001-Add-union-and-intersection-to-rx-bug-37849.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Mon, 09 Dec 2019 11:05:02 GMT) Full text and rfc822 format available.

Message #14 received at 37849 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 37849 <at> debbugs.gnu.org
Cc: Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#37849: composable character alternatives in rx 
Date: Mon, 9 Dec 2019 12:04:40 +0100
Eli, as a matter of protocol: assuming the union/intersection patch meets no opposition, can it be pushed to master? It is self-contained and should not affect anything outside rx.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Mon, 09 Dec 2019 13:37:02 GMT) Full text and rfc822 format available.

Message #17 received at 37849 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37849 <at> debbugs.gnu.org
Subject: Re: bug#37849: composable character alternatives in rx
Date: Mon, 09 Dec 2019 15:36:15 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 9 Dec 2019 12:04:40 +0100
> Cc: Eli Zaretskii <eliz <at> gnu.org>
> 
> Eli, as a matter of protocol: assuming the union/intersection patch meets no opposition, can it be pushed to master? It is self-contained and should not affect anything outside rx.

It's a new feature, so yes, assuming that you've verified it passes
all the tests and cannot possibly interfere with any existing code.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37849; Package emacs. (Tue, 10 Dec 2019 21:40:02 GMT) Full text and rfc822 format available.

Message #20 received at 37849 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37849 <at> debbugs.gnu.org
Subject: Re: bug#37849: composable character alternatives in rx 
Date: Tue, 10 Dec 2019 22:39:35 +0100
Thank you, now pushed.





Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Fri, 13 Dec 2019 12:37:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Fri, 13 Dec 2019 12:37:02 GMT) Full text and rfc822 format available.

Message #25 received at 37849-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 37849-done <at> debbugs.gnu.org
Subject: Re: bug#37849: composable character alternatives in rx 
Date: Fri, 13 Dec 2019 13:35:42 +0100
As suggested by Stefan Monnier, 'union' was replaced with plain 'or' for character sets as well.

A minor usability improvement has been pushed to master as well: characters and single-char strings no longer have to be wrapped in (any...), so (not (any ?a)) can now be written (not ?a).





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 11 Jan 2020 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 100 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.