GNU bug report logs - #1877
Request: Regular expressions that can match Unicode general categories

Package: emacs;

Reported by: Derick Eddington <derick.eddington <at> gmail.com>

Date: Mon, 12 Jan 2009 20:45:02 UTC

Severity: wishlist

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 1877 in the body.
You can then email your comments to 1877 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#1877; Package emacs. (Mon, 12 Jan 2009 20:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Derick Eddington <derick.eddington <at> gmail.com>:
New bug report received and forwarded. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 12 Jan 2009 20:45:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Derick Eddington <derick.eddington <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Request: Regular expressions that can match Unicode general
 categories
Date: Mon, 12 Jan 2009 12:38:12 -0800

A new Scheme major mode I've made [1] requires regular expressions that
can match characters by their Unicode general categories.  It seems
Emacs regular expressions do not provide a way to do that directly (I'm
using GNU Emacs 23.0.60.1) (I couldn't find anything about it in the
Emacs documentation, emacswiki.org, or by asking on
help-gnu-emacs <at> gnu.org or in that list's archives).  So currently I
pre-compute character sets for the needed general categories (using
`get-char-code-property') and place these in their positions in the
larger regular expressions.  However, including character sets for every
general category I need makes the regular expressions too large for
Emacs and it errors trying to use them (some of them are pretty big); so
currently I'm not supporting all of them that are required.  Another
issue is these character sets are duplicated in different regular
expressions and since they're so large this causes code size bloat.
Another issue is I suspect matching character sets this large is not the
most time-efficient.

If Emacs regular expressions had some construct, similar to the existing
`\cC' one, that matched a character by its general category, I think
that would solve all the above issues nicely.  PLT Scheme regular
expressions have this ability [2].  

[1]
https://code.launchpad.net/~derick-eddington/scheme-mode/derick-.emacs.d
[2] http://docs.plt-scheme.org/reference/regexp.html

Thank you for your work on Emacs and for your time,

-- 
: Derick
----------------------------------------------------------------

Severity set to `wishlist' from `normal' Request was from Glenn Morris <rgm <at> gnu.org> to control <at> emacsbugs.donarmstrong.com. (Mon, 12 Jan 2009 22:15:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#1877; Package emacs. (Mon, 30 Sep 2019 07:46:02 GMT) Full text and rfc822 format available.

Message #10 received at 1877 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Derick Eddington <derick.eddington <at> gmail.com>
Cc: 1877 <at> debbugs.gnu.org
Subject: Re: bug#1877: Request: Regular expressions that can match Unicode
 general categories
Date: Mon, 30 Sep 2019 09:45:15 +0200

Derick Eddington <derick.eddington <at> gmail.com> writes:

> A new Scheme major mode I've made [1] requires regular expressions that
> can match characters by their Unicode general categories.  It seems
> Emacs regular expressions do not provide a way to do that directly (I'm
> using GNU Emacs 23.0.60.1)

(I'm going through old bug reports that unfortunately didn't get any
response at the time.)

I'm not quite sure what Unicode general categories you're referring to,
but the Emacs regexp matcher has gained a bunch of categories in the ten
years since you made the request.

Are the categories below what you were thinking of?

‘[:print:]’
     This matches any printing character—either whitespace, or a graphic
     character matched by ‘[:graph:]’.
‘[:punct:]’
     This matches any punctuation character.  (At present, for multibyte
     characters, it matches anything that has non-word syntax.)
‘[:space:]’
     This matches any character that has whitespace syntax (*note Syntax
     Class Table::).
‘[:upper:]’
     This matches any upper-case letter, as determined by the current
     case table (*note Case Tables::).  If ‘case-fold-search’ is
     non-‘nil’, this also matches any lower-case letter.
‘[:word:]’
     This matches any character that has word syntax (*note Syntax Class
     Table::).

(etc)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 30 Sep 2019 07:46:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#1877; Package emacs. (Mon, 30 Sep 2019 08:46:02 GMT) Full text and rfc822 format available.

Message #15 received at 1877 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: derick.eddington <at> gmail.com, 1877 <at> debbugs.gnu.org
Subject: Re: bug#1877: Request: Regular expressions that can match Unicode
 general categories
Date: Mon, 30 Sep 2019 11:45:14 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Date: Mon, 30 Sep 2019 09:45:15 +0200
> Cc: 1877 <at> debbugs.gnu.org
> 
> Derick Eddington <derick.eddington <at> gmail.com> writes:
> 
> > A new Scheme major mode I've made [1] requires regular expressions that
> > can match characters by their Unicode general categories.  It seems
> > Emacs regular expressions do not provide a way to do that directly (I'm
> > using GNU Emacs 23.0.60.1)
> 
> (I'm going through old bug reports that unfortunately didn't get any
> response at the time.)
> 
> I'm not quite sure what Unicode general categories you're referring to,
> but the Emacs regexp matcher has gained a bunch of categories in the ten
> years since you made the request.
> 
> Are the categories below what you were thinking of?
> 
> ‘[:print:]’
>      This matches any printing character—either whitespace, or a graphic
>      character matched by ‘[:graph:]’.
> ‘[:punct:]’
>      This matches any punctuation character.  (At present, for multibyte
>      characters, it matches anything that has non-word syntax.)
> ‘[:space:]’
>      This matches any character that has whitespace syntax (*note Syntax
>      Class Table::).
> ‘[:upper:]’
>      This matches any upper-case letter, as determined by the current
>      case table (*note Case Tables::).  If ‘case-fold-search’ is
>      non-‘nil’, this also matches any lower-case letter.
> ‘[:word:]’
>      This matches any character that has word syntax (*note Syntax Class
>      Table::).

No, he means the categories described in the node "Character
Properties" of the ELisp manual.

We don't yet have full support for the Unicode Regular Expressions, as
specified in UTS#18.  In particular, see

  http://unicode.org/reports/tr18/#General_Category_Property

for General Category regexp specs.

It is not clear to me which categories are of interest here.  Some of
them are nowadays definitely available indirectly via the classes
mentioned above (they weren't available in Emacs 23 when the bug was
filed).  Maybe the OP could provide an explicit list of categories
needed for this Scheme mode, together with their required usage in
this mode.  Looking at R6RS sec 4.2.1, all I see is "whitespace"
(which we provide via [:blank:]), "letter" (provided by [:alpha:]),
"digit" (provided by [:alnum:]), and "intraline whitespace" (provided
by [:blank:]).  If this is all, then we have all the required support
now.

Removed tag(s) moreinfo. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Thu, 16 Jan 2020 14:09:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#1877; Package emacs. (Sun, 14 Nov 2021 06:29:01 GMT) Full text and rfc822 format available.

Message #20 received at 1877 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: derick.eddington <at> gmail.com, 1877 <at> debbugs.gnu.org
Subject: Re: bug#1877: Request: Regular expressions that can match Unicode
 general categories
Date: Sun, 14 Nov 2021 07:28:06 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

> It is not clear to me which categories are of interest here.  Some of
> them are nowadays definitely available indirectly via the classes
> mentioned above (they weren't available in Emacs 23 when the bug was
> filed).  Maybe the OP could provide an explicit list of categories
> needed for this Scheme mode, together with their required usage in
> this mode.  Looking at R6RS sec 4.2.1, all I see is "whitespace"
> (which we provide via [:blank:]), "letter" (provided by [:alpha:]),
> "digit" (provided by [:alnum:]), and "intraline whitespace" (provided
> by [:blank:]).  If this is all, then we have all the required support
> now.

There was no response here (in two years), so I'm guessing that we have
the categories required, and I'm closing this bug report.  If there are
any further categories that would be useful to have added, please
respond to the debbugs address and we'll reopen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug closed, send any further explanations to 1877 <at> debbugs.gnu.org and Derick Eddington <derick.eddington <at> gmail.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 14 Nov 2021 06:29:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 12 Dec 2021 12:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 203 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #1877 Request: Regular expressions that can match Unicode general categories

GNU bug report logs - #1877
Request: Regular expressions that can match Unicode general categories