GNU bug report logs -
#65726
29.1.50; Crash in regexp engine
Previous Next
Reported by: martin rudalics <rudalics <at> gmx.at>
Date: Mon, 4 Sep 2023 07:48:02 UTC
Severity: normal
Found in version 29.1.50
Fixed in version 30.1
Done: Stefan Kangas <stefankangas <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 65726 in the body.
You can then email your comments to 65726 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 07:48:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
martin rudalics <rudalics <at> gmx.at>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 04 Sep 2023 07:48:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
With emacs -Q load the attached file elinfo.el. Now type
C-h S split-window RET
C-x o
C-c C-g
This crashes Emacs here with
Thread 1 "emacs" received signal SIGSEGV, Segmentation fault.
0x000000000068810a in skip_noops (p=<error reading variable: Cannot access memory at address 0x7fffff66fff8>, pend=<error reading variable: Cannot access memory at address 0x7fffff66fff0>) at ../../src/regex-emacs.c:3556
3556 {
and an infinite backtrace starting with
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8:
#0 0x000000000068810a in skip_noops (p=#1 0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665
#2 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
#3 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839
#4 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839
#5 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
...
The same scenario worked well with Emacs 22 through 28.
martin
In GNU Emacs 29.1.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.5,
cairo version 1.16.0) of 2023-09-03 built on restno
Repository revision: f1e4cbe72aa4da9351cbbcd209d9233c68dd9fbb
Repository branch: emacs-29
Windowing system distributor 'The X.Org Foundation', version 11.0.12004000
System Description: Debian GNU/Linux 10 (buster)
Configured using:
'configure --with-gif=ifavailable --with-tiff=ifavailable
--with-gnutls=no --without-pop --enable-gcc-warnings=warn-only
--enable-checking=yes,glyphs --enable-check-lisp-object-type=yes
'CFLAGS=-O0 -g3 -no-pie -Wno-missing-braces''
Configured features:
CAIRO DBUS FREETYPE GIF GLIB GSETTINGS HARFBUZZ JPEG LIBSELINUX MODULES
NOTIFY INOTIFY PDUMPER PNG SECCOMP SOUND THREADS TOOLKIT_SCROLL_BARS X11
XDBE XIM XINPUT2 XPM GTK3 ZLIB
Important settings:
value of $LANG: de_AT.utf8
value of $XMODIFIERS: @im=ibus
locale-coding-system: utf-8-unix
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
eldoc-mode: t
show-paren-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
line-number-mode: t
indent-tabs-mode: t
transient-mark-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
Load-path shadows:
None found.
Features:
(shadow sort mail-extr emacsbug message mailcap yank-media puny dired
dired-loaddefs rfc822 mml mml-sec password-cache epa derived epg rfc6068
epg-config gnus-util text-property-search time-date subr-x mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
cl-loaddefs cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils elinfo texinfo texinfo-loaddefs info rmc
iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook
vc-hooks lisp-float-type elisp-mode mwheel term/x-win x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode lisp-mode prog-mode register
page tab-bar menu-bar rfn-eshadow isearch easymenu timer select
scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors
frame minibuffer nadvice seq simple cl-generic indonesian philippine
cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese composite emoji-zwj charscript
charprop case-table epa-hook jka-cmpr-hook help abbrev obarray oclosure
cl-preloaded button loaddefs theme-loaddefs faces cus-face macroexp
files window text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget keymap hashtable-print-readable backquote
threads dbusbind inotify dynamic-setting system-font-setting
font-render-setting cairo move-toolbar gtk x-toolkit xinput2 x multi-tty
make-network-process emacs)
Memory information:
((conses 16 45758 8352)
(symbols 48 5682 0)
(strings 32 15622 2110)
(string-bytes 1 431024)
(vectors 16 10081)
(vector-slots 8 157669 13073)
(floats 8 26 23)
(intervals 56 218 0)
(buffers 976 10))
[elinfo.el (text/x-emacs-lisp, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 08:45:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 65726 <at> debbugs.gnu.org (full text, mbox):
Can't reproduce here (macOS). Can you give us a deeper backtrace?
Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 12:14:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Cc: martin rudalics <rudalics <at> gmx.at>
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Mon, 4 Sep 2023 10:44:06 +0200
>
> Can't reproduce here (macOS). Can you give us a deeper backtrace?
I can reproduce on MS-Windows. It's an infinite recursion.
> Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against?
Here's the end of the call stack:
#104528 0x0120bbc4 in search_buffer_re (string=XIL(0x8000000007e5e308),
pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1,
trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4),
posix=false) at search.c:1265
#104529 0x0120cae5 in search_buffer (string=XIL(0x8000000007e5e308),
pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1, RE=1,
trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4),
posix=false) at search.c:1527
#104530 0x0120b404 in search_command (string=XIL(0x8000000007e5e308),
bound=XIL(0), noerror=XIL(0x30), count=XIL(0), direction=1, RE=1,
posix=false) at search.c:1069
#104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294
#104532 0x0127860e in eval_sub (form=XIL(0xc000000007dbef40)) at eval.c:2508
#104533 0x0126f601 in Fand (args=XIL(0xc000000007dbe810)) at eval.c:370
#104534 0x012780e0 in eval_sub (form=XIL(0xc000000007dbef30)) at eval.c:2449
#104535 0x0126f847 in Fcond (args=XIL(0xc000000007e16d00)) at eval.c:412
#104536 0x012780e0 in eval_sub (form=XIL(0xc000000007e16d70)) at eval.c:2449
#104537 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
#104538 0x01272b16 in Flet (args=XIL(0xc000000007e16d90)) at eval.c:1026
#104539 0x012780e0 in eval_sub (form=XIL(0xc000000007e16da0)) at eval.c:2449
#104540 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
#104541 0x012780e0 in eval_sub (form=XIL(0xc000000007e17740)) at eval.c:2449
#104542 0x0126f74a in Fif (args=XIL(0xc000000007e16dd0)) at eval.c:391
#104543 0x012780e0 in eval_sub (form=XIL(0xc000000007e16dc0)) at eval.c:2449
#104544 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
#104545 0x01272b16 in Flet (args=XIL(0xc000000007e16e00)) at eval.c:1026
#104546 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e10)) at eval.c:2449
#104547 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
#104548 0x0126f7ae in Fif (args=XIL(0xc000000007e16e30)) at eval.c:392
#104549 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e20)) at eval.c:2449
#104550 0x0126fa9d in Fprogn (body=XIL(0xc000000007dbeb50)) at eval.c:436
#104551 0x0127247c in FletX (args=XIL(0xc000000007e16fc0)) at eval.c:958
#104552 0x012780e0 in eval_sub (form=XIL(0xc000000007e16fd0)) at eval.c:2449
#104553 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
#104554 0x0127b95e in funcall_lambda (fun=XIL(0xc000000007dbe750), nargs=0,
arg_vector=0x82ecb0) at eval.c:3233
#104555 0x0127a4c9 in funcall_general (fun=XIL(0xc000000007dbe750),
numargs=0, args=0x82ecb0) at eval.c:2957
#104556 0x0127a6b7 in Ffuncall (nargs=1, args=0x82eca8) at eval.c:2995
#104557 0x012693e3 in Ffuncall_interactively (nargs=1, args=0x82eca8)
at callint.c:250
#104558 0x0127ad17 in funcall_subr (subr=0x1881580 <Sfuncall_interactively>,
numargs=1, args=0x82eca8) at eval.c:3059
#104559 0x0127a2cb in funcall_general (fun=XIL(0xa000000001881580),
numargs=1, args=0x82eca8) at eval.c:2941
#104560 0x0127a6b7 in Ffuncall (nargs=2, args=0x82eca0) at eval.c:2995
#104561 0x012790cb in Fapply (nargs=3, args=0x82eca0) at eval.c:2619
#104562 0x01269ae9 in Fcall_interactively (function=XIL(0x60de7c0),
record_flag=XIL(0), keys=XIL(0xa000000007970198)) at callint.c:342
#104563 0x0127a915 in funcall_subr (subr=0x18815c0 <Scall_interactively>,
numargs=3, args=0x6c50078) at eval.c:3038
#104564 0x012ed316 in exec_byte_code (fun=XIL(0xa00000000615797c),
args_template=1025, nargs=1, args=0x82f590) at bytecode.c:809
#104565 0x0127ae66 in fetch_and_exec_byte_code (fun=XIL(0xa00000000615797c),
args_template=1025, nargs=1, args=0x82f588) at eval.c:3081
#104566 0x0127b3c5 in funcall_lambda (fun=XIL(0xa00000000615797c), nargs=1,
arg_vector=0x82f588) at eval.c:3153
#104567 0x0127a332 in funcall_general (fun=XIL(0xa00000000615797c),
numargs=1, args=0x82f588) at eval.c:2945
#104568 0x0127a6b7 in Ffuncall (nargs=2, args=0x82f580) at eval.c:2995
#104569 0x0116d992 in call1 (fn=XIL(0x4590), arg1=XIL(0x60de7c0))
at lisp.h:3248
#104570 0x01171cb7 in command_loop_1 () at keyboard.c:1503
#104571 0x01274470 in internal_condition_case (
bfun=0x11710ec <command_loop_1>, handlers=XIL(0x90),
hfun=0x11700ba <cmd_error>) at eval.c:1474
#104572 0x01170b59 in command_loop_2 (handlers=XIL(0x90)) at keyboard.c:1133
#104573 0x012732f7 in internal_catch (tag=XIL(0x103b0),
func=0x1170b22 <command_loop_2>, arg=XIL(0x90)) at eval.c:1197
#104574 0x01170ac4 in command_loop () at keyboard.c:1111
#104575 0x0116fb1a in recursive_edit_1 () at keyboard.c:720
#104576 0x0116fdb8 in Frecursive_edit () at keyboard.c:803
#104577 0x0116ab60 in main (argc=2, argv=0xa428e0) at emacs.c:2521
Lisp Backtrace:
"re-search-forward" (0x82d6a0)
"and" (0x82d8a0)
"cond" (0x82da60)
"let" (0x82dcf0)
"progn" (0x82deb0)
"if" (0x82e070)
"let" (0x82e300)
"if" (0x82e500)
"let*" (0x82e770)
"elinfo-goto-texi" (0x82ecb0)
"funcall-interactively" (0x82eca8)
"call-interactively" (0x6c50078)
"command-execute" (0x82f588)
(gdb)
In frame #104531, we have:
#104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294
and regexp is this monstrocity:
"\\(split\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)-\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)&optional\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)size\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)side\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)pixelwise\\)\\|^node"
Feel free to ask more questions, as I can reproduce this at will.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 13:19:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 65726 <at> debbugs.gnu.org (full text, mbox):
4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz <at> gnu.org>:
> I can reproduce on MS-Windows. It's an infinite recursion.
Hmmm, strange that I'm unable to then.
The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade?
I shall give this another go but meanwhile maybe Stefan could look at it, too?
Not that I am implything there is anything wrong with his code. Heavens, no.
> In frame #104531, we have:
>
> #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
> bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294
That should be the re-search-forward call in elinfo.el:973.
> and regexp is this monstrocity:
Thank you, this one is apparently generated dynamically.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 13:27:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Hmmm, strange that I'm unable to then.
Now I can reproduce the crash (I didn't have any TAGS files).
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 13:29:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Mon, 4 Sep 2023 15:18:01 +0200
> Cc: 65726 <at> debbugs.gnu.org,
> martin rudalics <rudalics <at> gmx.at>,
> Stefan Monnier <monnier <at> iro.umontreal.ca>
>
> 4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> > I can reproduce on MS-Windows. It's an infinite recursion.
>
> Hmmm, strange that I'm unable to then.
>
> The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade?
Yes, if I revert that change, the problem goes away.
> I shall give this another go but meanwhile maybe Stefan could look at it, too?
> Not that I am implything there is anything wrong with his code. Heavens, no.
Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 14:33:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8:
> #0 0x000000000068810a in skip_noops (p=#1 0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665
> #2 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
> #3 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839
> #4 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839
> #5 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
> ...
Hmm... the line numbers strongly suggests the inf-recursion happens via
the calls:
case on_failure_jump:
{
int mcnt;
p2++;
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
/* Don't just test `mcnt > 0` because non-greedy loops have
their test at the end with an unconditional jump at the start. */
if (p2 + mcnt > p2_orig) /* Ensure forward progress. */
return (mutually_exclusive_p (bufp, p1, p2)
&& mutually_exclusive_p (bufp, p1, p2 + mcnt));
break;
}
Re-reading the code I see that `skip_noops` can return a position
smaller than its argument, which makes it possible for `p2` to
be smaller (or equal) to `p2_orig` and hence explain that inf-loop
(that's the only path I can see that explains the inf-loop you're
seeing).
So, the patch below should hopefully fix your problem.
Stefan
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 7e75f0ac597..3a14c10771d 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3832,7 +3832,8 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
/* Don't just test `mcnt > 0` because non-greedy loops have
their test at the end with an unconditional jump at the start. */
- if (p2 + mcnt > p2_orig) /* Ensure forward progress. */
+ if (p2 + mcnt > p2_orig /* Ensure forward progress. */
+ && p2 > p2_orig) /* Bug#65726 */
return (mutually_exclusive_p (bufp, p1, p2)
&& mutually_exclusive_p (bufp, p1, p2 + mcnt));
break;
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 15:48:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 65726 <at> debbugs.gnu.org (full text, mbox):
Minimised reproduction recipe:
(string-match (rx (* "a") (* (* "b"))) "a")
(Don't worry about the nested stars; the original was more like (* (| A (* B))).)
The generated bytecode is:
0 on-failure-jump-smart to 9
3 exact "a"
6 jump to 0
9 on-failure-jump-loop to 24
12 on-failure-jump-smart to 21
15 exact "b"
18 jump to 12
21 jump to 9
24 succeed
which suggests a mutual_exclusive_p recursion loop:
mep(3,9)
-> mep(3,12)
-> mep(3,21) = {skip_noop} = mep(3,9)
-> mep(3,12)
...
Maybe there is something clever we could do here but it's possible that we just can't avoid keeping track of visited nodes.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 15:59:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Cc: 65726 <at> debbugs.gnu.org
> Date: Mon, 04 Sep 2023 10:32:38 -0400
> From: Stefan Monnier via "Bug reports for GNU Emacs,
> the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
>
> Re-reading the code I see that `skip_noops` can return a position
> smaller than its argument, which makes it possible for `p2` to
> be smaller (or equal) to `p2_orig` and hence explain that inf-loop
> (that's the only path I can see that explains the inf-loop you're
> seeing).
>
> So, the patch below should hopefully fix your problem.
Yes, with this patch the problem is gone.
Please install on the release branch, when you are satisfied with the
solution, and thanks.
Reply sent
to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
You have taken responsibility.
(Mon, 04 Sep 2023 17:13:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
martin rudalics <rudalics <at> gmx.at>
:
bug acknowledged by developer.
(Mon, 04 Sep 2023 17:13:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 65726-done <at> debbugs.gnu.org (full text, mbox):
> Yes, with this patch the problem is gone.
Thanks, pushed to `emacs-29`.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 04 Sep 2023 17:13:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Tue, 05 Sep 2023 07:15:02 GMT)
Full text and
rfc822 format available.
Message #40 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> So, the patch below should hopefully fix your problem.
It fixes the problem.
Thanks, martin
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Tue, 05 Sep 2023 12:24:01 GMT)
Full text and
rfc822 format available.
Message #43 received at 65726 <at> debbugs.gnu.org (full text, mbox):
Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack overflow. It seems that by fixing this bug, we've unfixed the one that the broken code was supposed to fix.
And we should probably include a regression test for this bug as well. The previously mentioned
(string-match (rx (* "a") (* (* "b"))) "a")
might suffice, but extending it to
(string-match (rx (* "a") (* (or "c" (* "b")))) "a")
is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X).
Did not alter fixed versions and reopened.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 05 Sep 2023 12:35:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Tue, 05 Sep 2023 13:09:01 GMT)
Full text and
rfc822 format available.
Message #48 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack
> overflow.
Really? I thought I had specifically run that test.
[ ...recompiling + rerunning the test... ]
Hmm... you're right. Crap.
> It seems that by fixing this bug, we've unfixed the one that the
> broken code was supposed to fix.
Well, that one was less of a bug (more of a missing optimization), so
it's no reason to revert the change, but yes, not good.
Let's see if I can finagle something.
> And we should probably include a regression test for this bug as well.
> The previously mentioned
>
> (string-match (rx (* "a") (* (* "b"))) "a")
>
> might suffice, but extending it to
>
> (string-match (rx (* "a") (* (or "c" (* "b")))) "a")
>
> is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X).
Feel free. To me, it feels too much like testing the presence/absence
of a very specific bug (i.e. too unlikely that we'll reintroduce that
exact bug. It's not like it was a weird case I didn't think of).
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Tue, 05 Sep 2023 13:51:01 GMT)
Full text and
rfc822 format available.
Message #51 received at 65726 <at> debbugs.gnu.org (full text, mbox):
5 sep. 2023 kl. 15.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> Let's see if I can finagle something.
I liked your previous optimisation, so if you could finagle it back without breaking anything that would be pukka.
> Feel free. To me, it feels too much like testing the presence/absence
> of a very specific bug (i.e. too unlikely that we'll reintroduce that
> exact bug. It's not like it was a weird case I didn't think of).
Alright, pushed to emacs-29. It's a cheap test and offers at least some kind of protection in the event that someone undoes your handiwork. Such things do happen after all.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Tue, 05 Sep 2023 15:34:01 GMT)
Full text and
rfc822 format available.
Message #54 received at 65726 <at> debbugs.gnu.org (full text, mbox):
>> Feel free. To me, it feels too much like testing the presence/absence
>> of a very specific bug (i.e. too unlikely that we'll reintroduce that
>> exact bug. It's not like it was a weird case I didn't think of).
> Alright, pushed to emacs-29. It's a cheap test and offers at least some kind
> of protection in the event that someone undoes your handiwork. Such things
> do happen after all.
I marked the backtracking test as failing because the "real" fix wil
likely be too risky to put onto `emacs-29` (at least for now).
>> Let's see if I can finagle something.
> I liked your previous optimisation, so if you could finagle it back without
> breaking anything that would be pukka.
WDYT of the approach below?
You can ignore the `mutually_exclusive_(charset|exactn)` thingies which
just move code around (it's necessary for the patch).
It's a bit ugly because the actual assumptions on which it relies aren't
spelled out explicitly (and even less assert-checked at run time).
The basic idea is that `done` points to the beginning of the "current
loop": as long as we move forward we presume we may still be in the
current loop, and when we find a jump backward it's either a jump to the
beginning of a new (nested/further) loop (so we move `done`) or a jump
to the same loop or to some earlier loop (which we presumably handle
elsewhere).
I also tried a safer option where we do:
return ((p2 < done ? false
: p2 == done ? true
: mutually_exclusive_aux (bufp, p1, p2,
p2 > p2_orig ? done : p2))
&& (p2_other < done ? false
: p2_other == done ? true
: mutually_exclusive_aux (bufp, p1, p2_other,
p2_other > p2_orig
? done : p2_other)));
i.e. we only consider `done` itself as "already checked", which should
definitely be safe. It also seems to be sufficient to handle things
like:
(should (looking-at "x*\\(=+?\\|h\\)+"))
Adding an `assert (p2 >= done)` confirms that this does happen, so
whether we return true or false when `p2 < done` does matter, so I guess
we should go with the safer option unless we can really convince
ourselves that the more optimistic option is also correct.
Then again, maybe we should bite the bullet and actually keep track of
the positions already visited, so we don't need any "fancy" argument.
Stefan
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 7e75f0ac597..0d3cf39d619 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
return not;
}
+/* Case where `p2` points to an `exactn`. */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ bool multibyte = RE_MULTIBYTE_P (bufp);
+ int c
+ = (re_opcode_t) *p2 == endline ? '\n'
+ : RE_STRING_CHAR (p2 + 2, multibyte);
+
+ if ((re_opcode_t) *p1 == exactn)
+ {
+ if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+ {
+ DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
+ return true;
+ }
+ }
+
+ else if ((re_opcode_t) *p1 == charset
+ || (re_opcode_t) *p1 == charset_not)
+ {
+ if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+ Qnil))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == anychar
+ && c == '\n')
+ {
+ DEBUG_PRINT (" . != \\n => fast loop.\n");
+ return true;
+ }
+ return false;
+}
+
+/* Case where `p2` points to an `charset`. */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ /* It is hard to list up all the character in charset
+ P2 if it includes multibyte character. Give up in
+ such case. */
+ if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+ {
+ /* Now, we are sure that P2 has no range table.
+ So, for the size of bitmap in P2, 'p2[1]' is
+ enough. But P1 may have range table, so the
+ size of bitmap table of P1 is extracted by
+ using macro 'CHARSET_BITMAP_SIZE'.
+
+ In a multibyte case, we know that all the character
+ listed in P2 is ASCII. In a unibyte case, P1 has only a
+ bitmap table. So, in both cases, it is enough to test
+ only the bitmap table of P1. */
+
+ if ((re_opcode_t) *p1 == charset)
+ {
+ int idx;
+ /* We win if the charset inside the loop
+ has no overlap with the one after the loop. */
+ for (idx = 0;
+ (idx < (int) p2[1]
+ && idx < CHARSET_BITMAP_SIZE (p1));
+ idx++)
+ if ((p2[2 + idx] & p1[2 + idx]) != 0)
+ break;
+
+ if (idx == p2[1]
+ || idx == CHARSET_BITMAP_SIZE (p1))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == charset_not)
+ {
+ int idx;
+ /* We win if the charset_not inside the loop lists
+ every character listed in the charset after. */
+ for (idx = 0; idx < (int) p2[1]; idx++)
+ if (! (p2[2 + idx] == 0
+ || (idx < CHARSET_BITMAP_SIZE (p1)
+ && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+ break;
+
+ if (idx == p2[1])
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ }
+ return false;
+}
+
/* True if "p1 matches something" implies "p2 fails". */
+/* Avoiding inf-loops:
+ We're trying to follow all paths reachable from `p2`, but since some
+ loops can match the empty string, this can loop back to `p2`.
+ To avoid inf-looping, we rely on a lexicographic ordering using `done`
+ and `p2`: at every recursion, either `done` is larger, or it
+ is unchanged and `p2` is larger. */
static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
- re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2, re_char *done)
{
re_opcode_t op2;
- bool multibyte = RE_MULTIBYTE_P (bufp);
unsigned char *pend = bufp->buffer + bufp->used;
re_char *p2_orig = p2;
eassert (p1 >= bufp->buffer && p1 < pend
&& p2 >= bufp->buffer && p2 <= pend);
+ eassert (p2 >= done);
+
/* Skip over open/close-group commands.
If what follows this loop is a ...+ construct,
look at what begins its body, since we will have to
@@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
case endline:
case exactn:
- {
- int c
- = (re_opcode_t) *p2 == endline ? '\n'
- : RE_STRING_CHAR (p2 + 2, multibyte);
-
- if ((re_opcode_t) *p1 == exactn)
- {
- if (c != RE_STRING_CHAR (p1 + 2, multibyte))
- {
- DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
- return true;
- }
- }
-
- else if ((re_opcode_t) *p1 == charset
- || (re_opcode_t) *p1 == charset_not)
- {
- if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
- Qnil))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == anychar
- && c == '\n')
- {
- DEBUG_PRINT (" . != \\n => fast loop.\n");
- return true;
- }
- }
- break;
+ return mutually_exclusive_exactn (bufp, p1, p2);
case charset:
{
if ((re_opcode_t) *p1 == exactn)
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
-
- /* It is hard to list up all the character in charset
- P2 if it includes multibyte character. Give up in
- such case. */
- else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
- {
- /* Now, we are sure that P2 has no range table.
- So, for the size of bitmap in P2, 'p2[1]' is
- enough. But P1 may have range table, so the
- size of bitmap table of P1 is extracted by
- using macro 'CHARSET_BITMAP_SIZE'.
-
- In a multibyte case, we know that all the character
- listed in P2 is ASCII. In a unibyte case, P1 has only a
- bitmap table. So, in both cases, it is enough to test
- only the bitmap table of P1. */
-
- if ((re_opcode_t) *p1 == charset)
- {
- int idx;
- /* We win if the charset inside the loop
- has no overlap with the one after the loop. */
- for (idx = 0;
- (idx < (int) p2[1]
- && idx < CHARSET_BITMAP_SIZE (p1));
- idx++)
- if ((p2[2 + idx] & p1[2 + idx]) != 0)
- break;
-
- if (idx == p2[1]
- || idx == CHARSET_BITMAP_SIZE (p1))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == charset_not)
- {
- int idx;
- /* We win if the charset_not inside the loop lists
- every character listed in the charset after. */
- for (idx = 0; idx < (int) p2[1]; idx++)
- if (! (p2[2 + idx] == 0
- || (idx < CHARSET_BITMAP_SIZE (p1)
- && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
- break;
-
- if (idx == p2[1])
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- }
+ return mutually_exclusive_exactn (bufp, p2, p1);
+ else
+ return mutually_exclusive_charset (bufp, p1, p2);
}
break;
@@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
switch (*p1)
{
case exactn:
+ return mutually_exclusive_exactn (bufp, p2, p1);
case charset:
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
+ return mutually_exclusive_charset (bufp, p2, p1);
case charset_not:
/* When we have two charset_not, it's very unlikely that
they don't overlap. The union of the two sets of excluded
@@ -3830,11 +3852,18 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
int mcnt;
p2++;
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
- /* Don't just test `mcnt > 0` because non-greedy loops have
- their test at the end with an unconditional jump at the start. */
- if (p2 + mcnt > p2_orig) /* Ensure forward progress. */
- return (mutually_exclusive_p (bufp, p1, p2)
- && mutually_exclusive_p (bufp, p1, p2 + mcnt));
+ re_char *p2_other = p2 + mcnt;
+ /* Presumably, any position up to `done` should be a position we
+ already considered elsewhere (because our state machines are not
+ arbitrary: they consists of syntaxically nested loops with limited
+ control flow), so we can consider those back jumps as mutually
+ exclusive here. */
+ return ((p2 <= done
+ || mutually_exclusive_aux (bufp, p1, p2,
+ p2 > p2_orig ? done : p2))
+ && (p2_other <= done
+ || mutually_exclusive_aux (bufp, p1, p2_other,
+ p2_other > p2_orig ? done : p2_other)));
break;
}
@@ -3846,6 +3875,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
return false;
}
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1));
+}
/* Matching routines. */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Wed, 06 Sep 2023 12:04:02 GMT)
Full text and
rfc822 format available.
Message #57 received at 65726 <at> debbugs.gnu.org (full text, mbox):
5 sep. 2023 kl. 17.33 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> You can ignore the `mutually_exclusive_(charset|exactn)` thingies which
> just move code around (it's necessary for the patch).
Actually that's a good refactoring to start with regardless of the path we follow later. Makes things more readable if nothing else.
> The basic idea is that `done` points to the beginning of the "current
> loop": as long as we move forward we presume we may still be in the
> current loop, and when we find a jump backward it's either a jump to the
> beginning of a new (nested/further) loop (so we move `done`) or a jump
> to the same loop or to some earlier loop (which we presumably handle
> elsewhere).
So how can we describe the exact semantics of `done` in mutually_exclusive_aux(p1,p2,done)?
Something like:
Let me(p1,p2) = true iff (p1 matches) implies (p2 fails).
Then me_aux(p1,p2,done) = me(p1,p2)
under the assumption that me(p1,x) holds for all x<done where x is reachable from p2.
?
> Adding an `assert (p2 >= done)` confirms that this does happen,
Can you give an example of such a regexp? I ran `make check` with your patch applied and checking enabled but got no assertion failure.
> so
> whether we return true or false when `p2 < done` does matter, so I guess
> we should go with the safer option unless we can really convince
> ourselves that the more optimistic option is also correct.
>
> Then again, maybe we should bite the bullet and actually keep track of
> the positions already visited, so we don't need any "fancy" argument.
There are a couple of ways of doing that but it might not be that bad. It would also be more robust in face of future changes that may not obey the current invariants.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 09 Sep 2023 15:56:02 GMT)
Full text and
rfc822 format available.
Message #60 received at 65726 <at> debbugs.gnu.org (full text, mbox):
>> Adding an `assert (p2 >= done)` confirms that this does happen,
>
> Can you give an example of such a regexp? I ran `make check` with your patch
> applied and checking enabled but got no assertion failure.
My patch didn't include the assertion :-)
The patch below does and hits an assertion failure during the dump:
Loading format...
Loading bindings...
WOW1!
Error: error ("WOW1!")
mapbacktrace(#[1028 "\1\4\203\24\0\301\302!\210\300\4!\210\301\303!\210\202\35\0\301\304!\210\3\3B\262\1\211\2035\0\300\1@!\210\211A\211\262\2\2035\0\301\305!\210\202!\0\301\306!\207" [prin1 princ " " "(" " (" " " ")\n"] 7 "\n\n(fn EVALD FUNC ARGS FLAGS)"])
debug-early-backtrace()
debug-early(error (error "WOW1!"))
string-match("\\`[^ ]+\\( [^ ]+\\)*\\'" "h r" nil t)
key-valid-p("h r")
keymap--check("h r")
keymap-set((keymap (27 keymap (119 . eww-search-words)) (111 . occur)) "h r" highlight-regexp)
define-keymap("o" occur "M-w" eww-search-words "h r" highlight-regexp "h p" highlight-phrase "h l" highlight-lines-matching-regexp "h ." highlight-symbol-at-point "h u" unhighlight-regexp "h f" hi-lock-find-patterns "h w" hi-lock-write-interactive-patterns)
(defvar search-map (define-keymap "o" 'occur "M-w" 'eww-search-words "h r" 'highlight-regexp "h p" 'highlight-phrase "h l" 'highlight-lines-matching-regexp "h ." 'highlight-symbol-at-point "h u" 'unhighlight-regexp "h f" 'hi-lock-find-patterns "h w" 'hi-lock-write-interactive-patterns) ("/home/monnier/src/emacs/main/lisp/bindings.elc" . 40677))
load("bindings")
load("loadup.el")
-- Stefan
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 394ba22e9b0..b47cdfbfef8 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
return not;
}
+/* Case where `p2` points to an `exactn`. */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ bool multibyte = RE_MULTIBYTE_P (bufp);
+ int c
+ = (re_opcode_t) *p2 == endline ? '\n'
+ : RE_STRING_CHAR (p2 + 2, multibyte);
+
+ if ((re_opcode_t) *p1 == exactn)
+ {
+ if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+ {
+ DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
+ return true;
+ }
+ }
+
+ else if ((re_opcode_t) *p1 == charset
+ || (re_opcode_t) *p1 == charset_not)
+ {
+ if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+ Qnil))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == anychar
+ && c == '\n')
+ {
+ DEBUG_PRINT (" . != \\n => fast loop.\n");
+ return true;
+ }
+ return false;
+}
+
+/* Case where `p2` points to an `charset`. */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ /* It is hard to list up all the character in charset
+ P2 if it includes multibyte character. Give up in
+ such case. */
+ if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+ {
+ /* Now, we are sure that P2 has no range table.
+ So, for the size of bitmap in P2, 'p2[1]' is
+ enough. But P1 may have range table, so the
+ size of bitmap table of P1 is extracted by
+ using macro 'CHARSET_BITMAP_SIZE'.
+
+ In a multibyte case, we know that all the character
+ listed in P2 is ASCII. In a unibyte case, P1 has only a
+ bitmap table. So, in both cases, it is enough to test
+ only the bitmap table of P1. */
+
+ if ((re_opcode_t) *p1 == charset)
+ {
+ int idx;
+ /* We win if the charset inside the loop
+ has no overlap with the one after the loop. */
+ for (idx = 0;
+ (idx < (int) p2[1]
+ && idx < CHARSET_BITMAP_SIZE (p1));
+ idx++)
+ if ((p2[2 + idx] & p1[2 + idx]) != 0)
+ break;
+
+ if (idx == p2[1]
+ || idx == CHARSET_BITMAP_SIZE (p1))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == charset_not)
+ {
+ int idx;
+ /* We win if the charset_not inside the loop lists
+ every character listed in the charset after. */
+ for (idx = 0; idx < (int) p2[1]; idx++)
+ if (! (p2[2 + idx] == 0
+ || (idx < CHARSET_BITMAP_SIZE (p1)
+ && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+ break;
+
+ if (idx == p2[1])
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ }
+ return false;
+}
+
/* True if "p1 matches something" implies "p2 fails". */
+/* Avoiding inf-loops:
+ We're trying to follow all paths reachable from `p2`, but since some
+ loops can match the empty string, this can loop back to `p2`.
+ To avoid inf-looping, we rely on a lexicographic ordering using `done`
+ and `p2`: at every recursion, either `done` is larger, or it
+ is unchanged and `p2` is larger. */
static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
- re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2, re_char *done)
{
re_opcode_t op2;
- bool multibyte = RE_MULTIBYTE_P (bufp);
unsigned char *pend = bufp->buffer + bufp->used;
re_char *p2_orig = p2;
eassert (p1 >= bufp->buffer && p1 < pend
&& p2 >= bufp->buffer && p2 <= pend);
+ eassert (p2 >= done);
+
/* Skip over open/close-group commands.
If what follows this loop is a ...+ construct,
look at what begins its body, since we will have to
@@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
case endline:
case exactn:
- {
- int c
- = (re_opcode_t) *p2 == endline ? '\n'
- : RE_STRING_CHAR (p2 + 2, multibyte);
-
- if ((re_opcode_t) *p1 == exactn)
- {
- if (c != RE_STRING_CHAR (p1 + 2, multibyte))
- {
- DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
- return true;
- }
- }
-
- else if ((re_opcode_t) *p1 == charset
- || (re_opcode_t) *p1 == charset_not)
- {
- if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
- Qnil))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == anychar
- && c == '\n')
- {
- DEBUG_PRINT (" . != \\n => fast loop.\n");
- return true;
- }
- }
- break;
+ return mutually_exclusive_exactn (bufp, p1, p2);
case charset:
{
if ((re_opcode_t) *p1 == exactn)
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
-
- /* It is hard to list up all the character in charset
- P2 if it includes multibyte character. Give up in
- such case. */
- else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
- {
- /* Now, we are sure that P2 has no range table.
- So, for the size of bitmap in P2, 'p2[1]' is
- enough. But P1 may have range table, so the
- size of bitmap table of P1 is extracted by
- using macro 'CHARSET_BITMAP_SIZE'.
-
- In a multibyte case, we know that all the character
- listed in P2 is ASCII. In a unibyte case, P1 has only a
- bitmap table. So, in both cases, it is enough to test
- only the bitmap table of P1. */
-
- if ((re_opcode_t) *p1 == charset)
- {
- int idx;
- /* We win if the charset inside the loop
- has no overlap with the one after the loop. */
- for (idx = 0;
- (idx < (int) p2[1]
- && idx < CHARSET_BITMAP_SIZE (p1));
- idx++)
- if ((p2[2 + idx] & p1[2 + idx]) != 0)
- break;
-
- if (idx == p2[1]
- || idx == CHARSET_BITMAP_SIZE (p1))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == charset_not)
- {
- int idx;
- /* We win if the charset_not inside the loop lists
- every character listed in the charset after. */
- for (idx = 0; idx < (int) p2[1]; idx++)
- if (! (p2[2 + idx] == 0
- || (idx < CHARSET_BITMAP_SIZE (p1)
- && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
- break;
-
- if (idx == p2[1])
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- }
+ return mutually_exclusive_exactn (bufp, p2, p1);
+ else
+ return mutually_exclusive_charset (bufp, p1, p2);
}
break;
@@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
switch (*p1)
{
case exactn:
+ return mutually_exclusive_exactn (bufp, p2, p1);
case charset:
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
+ return mutually_exclusive_charset (bufp, p2, p1);
case charset_not:
/* When we have two charset_not, it's very unlikely that
they don't overlap. The union of the two sets of excluded
@@ -3830,11 +3852,25 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
int mcnt;
p2++;
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
- /* Don't just test `mcnt > 0` because non-greedy loops have
- their test at the end with an unconditional jump at the start. */
- if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress. */
- return (mutually_exclusive_p (bufp, p1, p2)
- && mutually_exclusive_p (bufp, p1, p2 + mcnt));
+ re_char *p2_other = p2 + mcnt;
+ /* Presumably, any position up to `done` should be a position we
+ already considered elsewhere (because our state machines are not
+ arbitrary: they consists of syntaxically nested loops with limited
+ control flow), so we can consider those back jumps as mutually
+ exclusive here. */
+ if (p2 < done)
+ error ("WOW1!");
+ if (p2_other < done)
+ error ("WOW2!");
+ return ((p2 < done ? true
+ : p2 == done ? true
+ : mutually_exclusive_aux (bufp, p1, p2,
+ p2 > p2_orig ? done : p2))
+ && (p2_other < done ? true
+ : p2_other == done ? true
+ : mutually_exclusive_aux (bufp, p1, p2_other,
+ p2_other > p2_orig
+ ? done : p2_other)));
break;
}
@@ -3846,6 +3882,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
return false;
}
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1));
+}
/* Matching routines. */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 09 Sep 2023 16:35:03 GMT)
Full text and
rfc822 format available.
Message #63 received at 65726 <at> debbugs.gnu.org (full text, mbox):
9 sep. 2023 kl. 17.55 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> My patch didn't include the assertion :-)
Oh, I thought you meant the eassert (p2 >= done); added at the start of mutually_exclusive_aux. Sorry about the confusion.
Anyway, disassembling the regexp makes it clear why:
Disassembly of regexp "\\`[^ ]+\\( [^ ]+\\)*\\'"
0 begbuf
1 charset-not [ ]
8 on-failure-jump-smart to 21
11 charset-not [ ]
18 jump to 8
21 on-failure-jump to 54
24 start-memory group 1
26 exact " "
29 charset-not [ ]
36 on-failure-jump-smart to 49
39 charset-not [ ]
46 jump to 36
49 stop-memory group 1
51 jump to 21
54 endbuf
55 succeed
which unless I'm mistaken we can condense to:
Disassembly of regexp "\\(?:ba*\\)*"
0 on-failure-jump to 18
3 exact "b"
6 on-failure-jump-smart to 15
9 exact "a"
12 jump to 6
15 jump to 0
18 succeed
so yes, we may need to remember where we've been. (At this point someone will inevitably point out a helpful invariant that is obvious in hindsight. This is just my cunning attempt at making that happen.)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sun, 10 Sep 2023 07:51:02 GMT)
Full text and
rfc822 format available.
Message #66 received at 65726 <at> debbugs.gnu.org (full text, mbox):
Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
>> Yes, with this patch the problem is gone.
>
> Thanks, pushed to `emacs-29`.
Is there anything more to do here, or can this bug report be closed?
Reply sent
to
Stefan Kangas <stefankangas <at> gmail.com>
:
You have taken responsibility.
(Sun, 10 Sep 2023 07:51:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
martin rudalics <rudalics <at> gmx.at>
:
bug acknowledged by developer.
(Sun, 10 Sep 2023 07:51:02 GMT)
Full text and
rfc822 format available.
bug Marked as fixed in versions 30.1.
Request was from
Stefan Kangas <stefankangas <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Sun, 10 Sep 2023 07:52:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sun, 10 Sep 2023 07:56:02 GMT)
Full text and
rfc822 format available.
Message #76 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Sun, 10 Sep 2023 00:50:48 -0700
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 65726 <at> debbugs.gnu.org, rudalics <at> gmx.at,
> 65726-done <at> debbugs.gnu.org
>
> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
>
> >> Yes, with this patch the problem is gone.
> >
> > Thanks, pushed to `emacs-29`.
>
> Is there anything more to do here, or can this bug report be closed?
I believe this is still being worked on and discussed by Stefan and
Mattias.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sun, 10 Sep 2023 23:10:02 GMT)
Full text and
rfc822 format available.
Message #79 received at 65726 <at> debbugs.gnu.org (full text, mbox):
>> >> Yes, with this patch the problem is gone.
>> > Thanks, pushed to `emacs-29`.
>> Is there anything more to do here, or can this bug report be closed?
> I believe this is still being worked on and discussed by Stefan and
> Mattias.
I think we can close it. What is being discussed is how to recover the
optimization that had to be significantly weakened to fix this patch,
but it was a new optimization in Emacs-29, so it's not urgent nor
a bug fix.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 11 Sep 2023 08:12:02 GMT)
Full text and
rfc822 format available.
Message #82 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> I think we can close it. What is being discussed is how to recover the optimization that had to be significantly weakened to fix this patch, but it was a new optimization in Emacs-29, so it's not urgent nor a bug fix.
Would you confirm that bug#61514 remains 'fixed' for practical purposes? I ran a few of the cases in that bug and it seems that we have them still covered, but I may have missed something.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 11 Sep 2023 13:42:02 GMT)
Full text and
rfc822 format available.
Message #85 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> Would you confirm that bug#61514 remains 'fixed' for practical purposes?
I believe it does: the real fix was to use narrowing.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 11 Sep 2023 14:47:02 GMT)
Full text and
rfc822 format available.
Message #88 received at 65726-done <at> debbugs.gnu.org (full text, mbox):
Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
> I think we can close it. What is being discussed is how to recover the
> optimization that had to be significantly weakened to fix this patch,
> but it was a new optimization in Emacs-29, so it's not urgent nor
> a bug fix.
Thanks, I'm therefore closing this bug report.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Thu, 14 Sep 2023 14:42:01 GMT)
Full text and
rfc822 format available.
Message #91 received at 65726 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
> so yes, we may need to remember where we've been. (At this point
> someone will inevitably point out a helpful invariant that is obvious
> in hindsight. This is just my cunning attempt at making that happen.)
No helpful invariant that makes it trivial, but if we keep pushing the
same idea that we rely on the assumption that we just have
a syntactically nested loop nest, then we can handle that as in the
patch below.
I.e. keep the idea I proposed of keeping track of a beg..end region
that's already been handled. But now we really do have to maintain both
ends (before, I only had `done` which kept track of the end of the
region), and just "restart" when we jump back to a point before the
"done" region.
Stefan
[regexp.c.patch (text/x-diff, inline)]
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 394ba22e9b0..a6b7faadf86 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,128 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
return not;
}
+/* Case where `p2` points to an `exactn`. */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ bool multibyte = RE_MULTIBYTE_P (bufp);
+ int c
+ = (re_opcode_t) *p2 == endline ? '\n'
+ : RE_STRING_CHAR (p2 + 2, multibyte);
+
+ if ((re_opcode_t) *p1 == exactn)
+ {
+ if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+ {
+ DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
+ return true;
+ }
+ }
+
+ else if ((re_opcode_t) *p1 == charset
+ || (re_opcode_t) *p1 == charset_not)
+ {
+ if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+ Qnil))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == anychar
+ && c == '\n')
+ {
+ DEBUG_PRINT (" . != \\n => fast loop.\n");
+ return true;
+ }
+ return false;
+}
+
+/* Case where `p2` points to an `charset`. */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ /* It is hard to list up all the character in charset
+ P2 if it includes multibyte character. Give up in
+ such case. */
+ if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+ {
+ /* Now, we are sure that P2 has no range table.
+ So, for the size of bitmap in P2, 'p2[1]' is
+ enough. But P1 may have range table, so the
+ size of bitmap table of P1 is extracted by
+ using macro 'CHARSET_BITMAP_SIZE'.
+
+ In a multibyte case, we know that all the character
+ listed in P2 is ASCII. In a unibyte case, P1 has only a
+ bitmap table. So, in both cases, it is enough to test
+ only the bitmap table of P1. */
+
+ if ((re_opcode_t) *p1 == charset)
+ {
+ int idx;
+ /* We win if the charset inside the loop
+ has no overlap with the one after the loop. */
+ for (idx = 0;
+ (idx < (int) p2[1]
+ && idx < CHARSET_BITMAP_SIZE (p1));
+ idx++)
+ if ((p2[2 + idx] & p1[2 + idx]) != 0)
+ break;
+
+ if (idx == p2[1]
+ || idx == CHARSET_BITMAP_SIZE (p1))
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ else if ((re_opcode_t) *p1 == charset_not)
+ {
+ int idx;
+ /* We win if the charset_not inside the loop lists
+ every character listed in the charset after. */
+ for (idx = 0; idx < (int) p2[1]; idx++)
+ if (! (p2[2 + idx] == 0
+ || (idx < CHARSET_BITMAP_SIZE (p1)
+ && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+ break;
+
+ if (idx == p2[1])
+ {
+ DEBUG_PRINT (" No match => fast loop.\n");
+ return true;
+ }
+ }
+ }
+ return false;
+}
+
/* True if "p1 matches something" implies "p2 fails". */
+/* Avoiding inf-loops:
+ We're trying to follow all paths reachable from `p2`, but since some
+ loops can match the empty string, this can loop back to `p2`.
+ To avoid inf-looping, we keep track of points that have been considered
+ "already". Instead of keeping a list of such points, we assume
+ that the code is a "sequential loop nest" and only keep track of
+ `done_beg` and `done_end` which delimit a chunk of bytecode we consider
+ as already considered. */
static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
- re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2, re_char *done_beg, re_char *done_end)
{
re_opcode_t op2;
- bool multibyte = RE_MULTIBYTE_P (bufp);
unsigned char *pend = bufp->buffer + bufp->used;
re_char *p2_orig = p2;
eassert (p1 >= bufp->buffer && p1 < pend
&& p2 >= bufp->buffer && p2 <= pend);
+ eassert (done_beg <= done_end);
+ eassert (done_end <= p2);
+
/* Skip over open/close-group commands.
If what follows this loop is a ...+ construct,
look at what begins its body, since we will have to
@@ -3684,98 +3793,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
case endline:
case exactn:
- {
- int c
- = (re_opcode_t) *p2 == endline ? '\n'
- : RE_STRING_CHAR (p2 + 2, multibyte);
-
- if ((re_opcode_t) *p1 == exactn)
- {
- if (c != RE_STRING_CHAR (p1 + 2, multibyte))
- {
- DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]);
- return true;
- }
- }
-
- else if ((re_opcode_t) *p1 == charset
- || (re_opcode_t) *p1 == charset_not)
- {
- if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
- Qnil))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == anychar
- && c == '\n')
- {
- DEBUG_PRINT (" . != \\n => fast loop.\n");
- return true;
- }
- }
- break;
+ return mutually_exclusive_exactn (bufp, p1, p2);
case charset:
{
if ((re_opcode_t) *p1 == exactn)
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
-
- /* It is hard to list up all the character in charset
- P2 if it includes multibyte character. Give up in
- such case. */
- else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
- {
- /* Now, we are sure that P2 has no range table.
- So, for the size of bitmap in P2, 'p2[1]' is
- enough. But P1 may have range table, so the
- size of bitmap table of P1 is extracted by
- using macro 'CHARSET_BITMAP_SIZE'.
-
- In a multibyte case, we know that all the character
- listed in P2 is ASCII. In a unibyte case, P1 has only a
- bitmap table. So, in both cases, it is enough to test
- only the bitmap table of P1. */
-
- if ((re_opcode_t) *p1 == charset)
- {
- int idx;
- /* We win if the charset inside the loop
- has no overlap with the one after the loop. */
- for (idx = 0;
- (idx < (int) p2[1]
- && idx < CHARSET_BITMAP_SIZE (p1));
- idx++)
- if ((p2[2 + idx] & p1[2 + idx]) != 0)
- break;
-
- if (idx == p2[1]
- || idx == CHARSET_BITMAP_SIZE (p1))
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- else if ((re_opcode_t) *p1 == charset_not)
- {
- int idx;
- /* We win if the charset_not inside the loop lists
- every character listed in the charset after. */
- for (idx = 0; idx < (int) p2[1]; idx++)
- if (! (p2[2 + idx] == 0
- || (idx < CHARSET_BITMAP_SIZE (p1)
- && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
- break;
-
- if (idx == p2[1])
- {
- DEBUG_PRINT (" No match => fast loop.\n");
- return true;
- }
- }
- }
+ return mutually_exclusive_exactn (bufp, p2, p1);
+ else
+ return mutually_exclusive_charset (bufp, p1, p2);
}
break;
@@ -3783,9 +3808,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
switch (*p1)
{
case exactn:
+ return mutually_exclusive_exactn (bufp, p2, p1);
case charset:
- /* Reuse the code above. */
- return mutually_exclusive_p (bufp, p2, p1);
+ return mutually_exclusive_charset (bufp, p2, p1);
case charset_not:
/* When we have two charset_not, it's very unlikely that
they don't overlap. The union of the two sets of excluded
@@ -3830,12 +3855,30 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
int mcnt;
p2++;
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
- /* Don't just test `mcnt > 0` because non-greedy loops have
- their test at the end with an unconditional jump at the start. */
- if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress. */
- return (mutually_exclusive_p (bufp, p1, p2)
- && mutually_exclusive_p (bufp, p1, p2 + mcnt));
- break;
+ re_char *p2_other = p2 + mcnt;
+
+ /* Presumably, any position in `done_beg..end` should be a position we
+ already considered elsewhere (because our state machines are not
+ arbitrary: they consists of syntaxically nested loops with limited
+ control flow), so we can consider those back jumps as mutually
+ exclusive here.
+ This shows in the code below when `p3 > p2_orig`, i.e. when we jump
+ forward: in that case we bump `done_end` up to `p3` under
+ the assumption that any other reachable position
+ between `done_end` and `p3` will be checked by the *other*
+ call to RECURSE.
+ The recursion should terminate because of a lexical ordering between
+ `done_beg`, `done_end`, and `p2`:
+ At each recursion, either `done_beg` gets smaller,
+ or `done_beg` is unchanged and `done_end` gets larger
+ or they're both unchanged and `p2` gets larger. */
+#define RECURSE(p3) \
+ ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
+ : (p3) <= done_end ? true \
+ : mutually_exclusive_aux (bufp, p1, p3, done_beg, \
+ (p3) > p2_orig ? done_end : (p3)))
+
+ return RECURSE (p2) && RECURSE (p2_other);
}
default:
@@ -3846,6 +3889,13 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
return false;
}
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+ re_char *p2)
+{
+ re_char *done = min (p2, p1);
+ return mutually_exclusive_aux (bufp, p1, p2, done, done);
+}
/* Matching routines. */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Fri, 15 Sep 2023 20:05:02 GMT)
Full text and
rfc822 format available.
Message #94 received at 65726 <at> debbugs.gnu.org (full text, mbox):
14 sep. 2023 kl. 16.41 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> I.e. keep the idea I proposed of keeping track of a beg..end region
> that's already been handled. But now we really do have to maintain both
> ends (before, I only had `done` which kept track of the end of the
> region), and just "restart" when we jump back to a point before the
> "done" region.
Thank you, this looks workable and should enable some further optimisations. Another option would be to add the usual cycle detection, but this is probably faster.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Fri, 15 Sep 2023 22:21:01 GMT)
Full text and
rfc822 format available.
Message #97 received at 65726 <at> debbugs.gnu.org (full text, mbox):
>> I.e. keep the idea I proposed of keeping track of a beg..end region
>> that's already been handled. But now we really do have to maintain both
>> ends (before, I only had `done` which kept track of the end of the
>> region), and just "restart" when we jump back to a point before the
>> "done" region.
> Thank you, this looks workable and should enable some further
> optimisations.
I just pushed this to `master`.
> Another option would be to add the usual cycle detection,
That wouldn't be fun, tho, would it?
> but this is probably faster.
And more thrilling.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 16 Sep 2023 03:46:02 GMT)
Full text and
rfc822 format available.
Message #100 received at 65726 <at> debbugs.gnu.org (full text, mbox):
>> so yes, we may need to remember where we've been. (At this point
>> someone will inevitably point out a helpful invariant that is obvious
>> in hindsight. This is just my cunning attempt at making that happen.)
I think I have a "helpful invariant", finally:
Because of how we build our code, when we're at an `on_failure_jump` the
two branches can either both go forward (typically for a "|" or a "?"),
or *one* of them goes backward the other forward (for loops), where the
one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it
`p2_loop`) that goes back to the beginning of the loop and the other
(call it `p2_exit`) is the one that exits the loop.
Now, because our loops are nested with proper "structured programming",
there can't be any jump from within the loop to outside the loop except
for the current jump. And there can't be any jump from outside the loop
to inside the loop except by entering via `p2_loop`.
Since we have two recursive calls to `mutually_exclusive_p` (one for
`p2_exit` and one for `p2_loop`) and each one only needs to check those
positions not checked by the other, we can say that `p2_loop` only needs
to check the positions within the loop (i.e. between `p2_loop` and
`p2_exit`) and can presume that *all* other positions are checked by the
other recursive call (the one that starts at `p2_exit`).
So I think a single arg `done_end` (set, like the current `done_end`, to
`p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's
no way to go from `p2_loop` to before `p2_loop` without first going to
`p2_exit` (which is already checked by the other call).
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 16 Sep 2023 10:51:02 GMT)
Full text and
rfc822 format available.
Message #103 received at 65726 <at> debbugs.gnu.org (full text, mbox):
16 sep. 2023 kl. 05.45 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
>>> (At this point
>>> someone will inevitably point out a helpful invariant that is obvious
>>> in hindsight. This is just my cunning attempt at making that happen.)
>
> I think I have a "helpful invariant", finally:
Everything is proceeding as I have foreseen.
> Because of how we build our code, when we're at an `on_failure_jump` the
> two branches can either both go forward (typically for a "|" or a "?"),
> or *one* of them goes backward the other forward (for loops), where the
> one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it
> `p2_loop`) that goes back to the beginning of the loop and the other
> (call it `p2_exit`) is the one that exits the loop.
>
> Now, because our loops are nested with proper "structured programming",
> there can't be any jump from within the loop to outside the loop except
> for the current jump. And there can't be any jump from outside the loop
> to inside the loop except by entering via `p2_loop`.
>
> Since we have two recursive calls to `mutually_exclusive_p` (one for
> `p2_exit` and one for `p2_loop`) and each one only needs to check those
> positions not checked by the other, we can say that `p2_loop` only needs
> to check the positions within the loop (i.e. between `p2_loop` and
> `p2_exit`) and can presume that *all* other positions are checked by the
> other recursive call (the one that starts at `p2_exit`).
>
> So I think a single arg `done_end` (set, like the current `done_end`, to
> `p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's
> no way to go from `p2_loop` to before `p2_loop` without first going to
> `p2_exit` (which is already checked by the other call).
I think you are right, but wouldn't mind seeing it confirmed empirically. Say, by cross-checking using an an alternative (slower) implementation that directly checks whether a node has been visited before.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 16 Sep 2023 15:49:02 GMT)
Full text and
rfc822 format available.
Message #106 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> I think you are right, but wouldn't mind seeing it confirmed
> empirically. Say, by cross-checking using an an alternative (slower)
> implementation that directly checks whether a node has been visited before.
I'm reworking my code so it takes 2 arguments: the loop-entry and the
loop-exit. Any jump outside of those bounds should never happen so we
can assert it to check that our assumptions are right (and we can
return false when we don't compile assertions, so it's never over-optimistic).
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 18 Sep 2023 02:15:02 GMT)
Full text and
rfc822 format available.
Message #109 received at 65726 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
> I'm reworking my code so it takes 2 arguments: the loop-entry and the
> loop-exit. Any jump outside of those bounds should never happen so we
> can assert it to check that our assumptions are right (and we can
> return false when we don't compile assertions, so it's never over-optimistic).
Hmm... not quite working yet. With the patch below I get:
% make test/src/regex-emacs-tests
[...]
passed 30/34 regexp-multibyte-unibyte (0.000264 sec)
0: /on_failure_jump_smart to 9
3: /exactn/1/x
6: /jump to 0
9: /start_memory/1
11: /on_failure_jump to 26
14: /on_failure_jump_smart to 23
17: /exactn/1/=
20: /jump to 14
23: /jump to 29
26: /exactn/1/h
29: /stop_memory/1
31: /on_failure_jump_loop to 37
34: /jump to 9
37: /succeed
38: end of pattern.
38 bytes used/128 bytes allocated.
re_nsub: 1 regs_alloc: 1 can_be_null: 0
p1==3, p2==34, loop-entry==14, loop-exit==26
Test regexp-tests-backtrack-optimization backtrace:
looking-at("x*\\(=*\\|h\\)+")
[...]
The problem is that my code sees the `jump 9` (near the end) followed by
`start_memory` and `on_failure_jump to 26` as a a backward jump that
defines a loop whose body starts at 14 and whose exit point is at 26,
but that 14..26 is not a loop, it's a `|` and those don't obey the
assumption I made about the exit point (the 2 destinations of such an
`on_failure_jump` correspond to the first and the second patterns of the
`|`, i.e. entry&middle rather than entry&exit, so there *can* be
jumps straight out from the first point without going through the second
point 🙁).
BTW, here's an indented version of the code:
0: /on_failure_jump_smart to 9 {
3: /exactn/1/x
6: } /jump to 0
9: { /start_memory/1
11: /on_failure_jump to 26
14: /on_failure_jump_smart to 23 {
17: /exactn/1/=
20: } /jump to 14
23: /jump to 29
26: /exactn/1/h
29: /stop_memory/1
31: /on_failure_jump_loop to 37
34: } /jump to 9
37: /succeed
38: end of pattern.
Another problem we see here is that the main (second) loop spans 9..37
and is controlled by the `/on_failure_jump_loop to 37` but the two
destination of the OFJ are not 9 and 37 but 34 and 37 🙂.
So maybe we should `skip_noops` before looking at the destinations to
decide if it's a loop?
Stefan
[regexp.c.patch (text/x-diff, inline)]
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 55d0a6e8df8..c54f9f81890 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3068,8 +3068,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
continue;
case succeed_n:
- /* If N == 0, it should be an on_failure_jump_loop instead. */
- DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
+ /* If N == 0, it should be an on_failure_jump_loop instead.
+ `j` can be negative because `EXTRACT_NUMBER` extracts a
+ signed number whereas `succeed_n` treats it as unsigned. */
+ DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));
p += 4;
/* We only care about one iteration of the loop, so we don't
need to consider the case where this behaves like an
@@ -3743,20 +3745,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
}
/* True if "p1 matches something" implies "p2 fails". */
-/* Avoiding inf-loops:
- We're trying to follow all paths reachable from `p2`, but since some
+/* We're trying to follow all paths reachable from `p2`, but since some
loops can match the empty string, this can loop back to `p2`.
- To avoid inf-looping, we keep track of points that have been considered
- "already". Instead of keeping a list of such points, `done_beg` and
- `done_end` delimit a chunk of bytecode we already considered.
- To guarantee termination, a lexical ordering between `done_*` and `p2`
- should be obeyed:
- At each recursion, either `done_beg` gets smaller,
- or `done_beg` is unchanged and `done_end` gets larger
- or they're both unchanged and `p2` gets larger. */
+ To avoid inf-looping, we take advantage of the fact that
+ the bytecode we generate is made of syntactically nested loops, more
+ specifically, every loop has a single entry point and single exit point.
+ The function takes 2 more arguments (`loop_entry` and `loop_exit`).
+ `loop_entry` points to the sole entry point of the current loop and
+ `loop_exit` points to its sole exit point (when non-NULL).
+ The function can assume that `loop_exit` is "mutually exclusive".
+ Termination of recursive calls is ensured because either `loop_entry`
+ is larger, or it's equal but `p2` is larger. */
static bool
mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
- re_char *p2, re_char *done_beg, re_char *done_end)
+ re_char *p2, re_char *loop_entry, re_char *loop_exit)
{
re_opcode_t op2;
unsigned char *pend = bufp->buffer + bufp->used;
@@ -3765,8 +3767,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
eassert (p1 >= bufp->buffer && p1 < pend
&& p2 >= bufp->buffer && p2 <= pend);
- eassert (done_beg <= done_end);
- eassert (done_end <= p2);
+ if (p2 == loop_exit)
+ /* Presumably already checked elsewhere. */
+ return true;
+ eassert (loop_entry && p2 >= loop_entry);
+ /* eassert (!loop_exit || p2 < loop_exit); */
+ if (p2 < loop_entry)
+ return false;
+ if (loop_exit && p2 > loop_exit)
+ {
+ void print_compiled_pattern (struct re_pattern_buffer *bufp);
+ print_compiled_pattern (bufp);
+ fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
+ p1 - bufp->buffer, p2 - bufp->buffer,
+ loop_entry - bufp->buffer, loop_exit - bufp->buffer);
+ error ("WOW1!");
+ }
+ /* return false; */
/* Skip over open/close-group commands.
If what follows this loop is a ...+ construct,
@@ -3860,27 +3877,30 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
re_char *p2_other = p2 + mcnt;
- /* When we jump backward we bump `done_end` up to `p3` under
- the assumption that any other position between `done_end`
- and `p3` is either:
- - checked by the other call to RECURSE.
- - not reachable from here (e.g. for positions before the
- `on_failure_jump`), or at least not without first
- jumping before `done_beg`.
- This should hold because our state machines are not arbitrary:
- they consists of syntaxically nested loops with limited
- control flow.
- FIXME: This can fail (i.e. return true when it shouldn't)
- if we start generating bytecode with a different shape,
- so maybe we should bite the bullet and replace done_beg/end
- with an actual list of positions we've already processed. */
-#define RECURSE(p3) \
- ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
- : (p3) <= done_end ? true \
- : mutually_exclusive_aux (bufp, p1, p3, done_beg, \
- (p3) > p2_orig ? done_end : (p3)))
-
- return RECURSE (p2) && RECURSE (p2_other);
+ /* If we jump backward to the entry point of the current loop
+ it means it's a zero-length cycle through that loop, so
+ this cycle itself does not break mutual-exclusion.
+ If we jump backward to a new loop nested within the current one
+ then the `p2_other` points to the exit of that inner loop
+ and the other RECURSE will check what happens when we leave
+ the loop so we can focus on checking just the loop body,
+ so we pass new values for `loop-entry` and `loop_exit`.
+ In all other cases, we just recurse "normally": if it's before
+ `loop_entry` it's a bug that will be caught by checks at
+ the entrace of the function and otherwise it's a forward jump
+ which doesn't need any special treatment.
+ FIXME: This is failsafe (can't return true when it shouldn't)
+ but it could be too conservative if we start generating bytecode
+ with a different shape, so maybe we should bite the bullet and
+ replace done_beg/end with an actual list of positions we've
+ already processed. */
+#define RECURSE(p3, p3_other) \
+ ((p3) == loop_entry ? true \
+ : loop_entry < (p3) && (p3) < p2_orig ? \
+ mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \
+ : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit))
+
+ return RECURSE (p2, p2_other) && RECURSE (p2_other, p2);
}
default:
@@ -3895,7 +3915,7 @@ #define RECURSE(p3) \
mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
re_char *p2)
{
- return mutually_exclusive_aux (bufp, p1, p2, p2, p2);
+ return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL);
}
/* Matching routines. */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 18 Sep 2023 04:00:03 GMT)
Full text and
rfc822 format available.
Message #112 received at 65726 <at> debbugs.gnu.org (full text, mbox):
> So maybe we should `skip_noops` before looking at the destinations to
> decide if it's a loop?
Nah, just introduces other problems :-(
BTW maybe we should replace
31: /on_failure_jump_loop to 37
34: /jump to 9
with
31: /on_failure_dont_jump_loop to 9
It's quite funny actually: with the current bytecodes, every iteration
of a backtracking greedy loop runs `jump + on_failure_jump`, whereas for
a non-greedy loop every iteration runs just a single `on_failure_jump`.
Yet those bytecodes predate the introduction of non-greedy loops :-)
[ Of course, there are reasons for it, beside historical ones: it's harder
to rewrite a loop that ends with `on_failure_dont_jump_smart` into one
using `on_failure_keep_string_jump` :-( ]
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Mon, 18 Sep 2023 12:34:01 GMT)
Full text and
rfc822 format available.
Message #115 received at 65726 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
> BTW maybe we should replace
>
> 31: /on_failure_jump_loop to 37
> 34: /jump to 9
>
> with
>
> 31: /on_failure_dont_jump_loop to 9
In the mean time the patch below recognizes the above pattern, which
gives me code which is safe (it should both always terminate and should
never apply the optimization if it's not valid) yet does find the
optimization in all the cases where I expected it.
Stefan
[regexp.c.patch (text/x-diff, inline)]
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 55d0a6e8df8..e45d0f19dbf 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1923,12 +1923,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
if (!zero_times_ok && simple)
{ /* Since simple * loops can be made faster by using
- on_failure_keep_string_jump, we turn simple P+
- into PP* if P is simple. */
+ on_failure_keep_string_jump, we turn P+ into PP*
+ if P is simple.
+ We can't use `top: <BODY>; OFJS exit; J top; exit:`
+ Because the OFJS needs to be at the beginning
+ since we want to be able to replace
+ `top: OFJS exit; <BODY>; J top; exit`
+ with
+ `top: OFKSJ exit; loop: <BODY>; J loop; exit`,
+ i.e. a single OFJ at the beginning of the loop
+ rather than once per iteration. */
unsigned char *p1, *p2;
startoffset = b - laststart;
GET_BUFFER_SPACE (startoffset);
p1 = b; p2 = laststart;
+ /* This presumes that the code skipped
+ by `skip_one_char` is position-independent. */
while (p2 < p1)
*b++ = *p2++;
zero_times_ok = 1;
@@ -3068,8 +3078,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
continue;
case succeed_n:
- /* If N == 0, it should be an on_failure_jump_loop instead. */
- DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
+ /* If N == 0, it should be an on_failure_jump_loop instead.
+ `j` can be negative because `EXTRACT_NUMBER` extracts a
+ signed number whereas `succeed_n` treats it as unsigned. */
+ DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));
p += 4;
/* We only care about one iteration of the loop, so we don't
need to consider the case where this behaves like an
@@ -3743,20 +3755,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
}
/* True if "p1 matches something" implies "p2 fails". */
-/* Avoiding inf-loops:
- We're trying to follow all paths reachable from `p2`, but since some
+/* We're trying to follow all paths reachable from `p2`, but since some
loops can match the empty string, this can loop back to `p2`.
- To avoid inf-looping, we keep track of points that have been considered
- "already". Instead of keeping a list of such points, `done_beg` and
- `done_end` delimit a chunk of bytecode we already considered.
- To guarantee termination, a lexical ordering between `done_*` and `p2`
- should be obeyed:
- At each recursion, either `done_beg` gets smaller,
- or `done_beg` is unchanged and `done_end` gets larger
- or they're both unchanged and `p2` gets larger. */
+ To avoid inf-looping, we take advantage of the fact that
+ the bytecode we generate is made of syntactically nested loops, more
+ specifically, every loop has a single entry point and single exit point.
+ The function takes 2 more arguments (`loop_entry` and `loop_exit`).
+ `loop_entry` points to the sole entry point of the current loop and
+ `loop_exit` points to its sole exit point (when non-NULL).
+ The function can assume that `loop_exit` is "mutually exclusive".
+ Termination of recursive calls is ensured because either `loop_entry`
+ is larger, or it's equal but `p2` is larger. */
static bool
mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
- re_char *p2, re_char *done_beg, re_char *done_end)
+ re_char *p2, re_char *loop_entry, re_char *loop_exit)
{
re_opcode_t op2;
unsigned char *pend = bufp->buffer + bufp->used;
@@ -3765,8 +3777,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
eassert (p1 >= bufp->buffer && p1 < pend
&& p2 >= bufp->buffer && p2 <= pend);
- eassert (done_beg <= done_end);
- eassert (done_end <= p2);
+ if (p2 == loop_exit)
+ /* Presumably already checked elsewhere. */
+ return true;
+ eassert (loop_entry && p2 >= loop_entry);
+ /* eassert (!loop_exit || p2 < loop_exit); */
+ if (p2 < loop_entry)
+ return false;
+ if (loop_exit && p2 > loop_exit)
+ {
+ void print_compiled_pattern (struct re_pattern_buffer *bufp);
+ print_compiled_pattern (bufp);
+ fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
+ p1 - bufp->buffer, p2 - bufp->buffer,
+ loop_entry - bufp->buffer, loop_exit - bufp->buffer);
+ error ("WOW1!");
+ }
+ /* return false; */
/* Skip over open/close-group commands.
If what follows this loop is a ...+ construct,
@@ -3859,28 +3886,40 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
p2++;
EXTRACT_NUMBER_AND_INCR (mcnt, p2);
re_char *p2_other = p2 + mcnt;
+ /* For `+` loops, we often have an `on_failure_jump` that skips forward
+ over a subsequent `jump` for lack of an `on_failure_dont_jump`
+ kind of thing. Recognize this pattern since that subsequent
+ `jump` is the one that jumps to the loop-entry. */
+ if ((re_opcode_t) p2[0] == jump && mcnt == 3)
+ {
+ EXTRACT_NUMBER (mcnt, p2 + 1);
+ p2 += mcnt + 3;
+ }
- /* When we jump backward we bump `done_end` up to `p3` under
- the assumption that any other position between `done_end`
- and `p3` is either:
- - checked by the other call to RECURSE.
- - not reachable from here (e.g. for positions before the
- `on_failure_jump`), or at least not without first
- jumping before `done_beg`.
- This should hold because our state machines are not arbitrary:
- they consists of syntaxically nested loops with limited
- control flow.
- FIXME: This can fail (i.e. return true when it shouldn't)
- if we start generating bytecode with a different shape,
- so maybe we should bite the bullet and replace done_beg/end
- with an actual list of positions we've already processed. */
-#define RECURSE(p3) \
- ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
- : (p3) <= done_end ? true \
- : mutually_exclusive_aux (bufp, p1, p3, done_beg, \
- (p3) > p2_orig ? done_end : (p3)))
-
- return RECURSE (p2) && RECURSE (p2_other);
+ /* If we jump backward to the entry point of the current loop
+ it means it's a zero-length cycle through that loop, so
+ this cycle itself does not break mutual-exclusion.
+ If we jump backward to a new loop nested within the current one
+ then the `p2_other` points to the exit of that inner loop
+ and the other RECURSE will check what happens when we leave
+ the loop so we can focus on checking just the loop body,
+ so we pass new values for `loop-entry` and `loop_exit`.
+ In all other cases, we just recurse "normally": if it's before
+ `loop_entry` it's a bug that will be caught by checks at
+ the entrace of the function and otherwise it's a forward jump
+ which doesn't need any special treatment.
+ FIXME: This is failsafe (can't return true when it shouldn't)
+ but it could be too conservative if we start generating bytecode
+ with a different shape, so maybe we should bite the bullet and
+ replace done_beg/end with an actual list of positions we've
+ already processed. */
+#define RECURSE(p3, p3_other) \
+ ((p3) == loop_entry ? true \
+ : loop_entry < (p3) && (p3) < p2_orig && (p3_other) > p2_orig ? \
+ mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \
+ : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit))
+
+ return RECURSE (p2, p2_other) && RECURSE (p2_other, p2);
}
default:
@@ -3895,7 +3934,7 @@ #define RECURSE(p3) \
mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
re_char *p2)
{
- return mutually_exclusive_aux (bufp, p1, p2, p2, p2);
+ return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL);
}
/* Matching routines. */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Thu, 21 Sep 2023 17:25:01 GMT)
Full text and
rfc822 format available.
Message #118 received at 65726 <at> debbugs.gnu.org (full text, mbox):
18 sep. 2023 kl. 14.32 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> In the mean time the patch below recognizes the above pattern, which
> gives me code which is safe (it should both always terminate and should
> never apply the optimization if it's not valid) yet does find the
> optimization in all the cases where I expected it.
Yes, maybe this would work but the ad-hocness of it all means that I very much could have missed something (and as you comment, that it's brittle).
> case succeed_n:
> - /* If N == 0, it should be an on_failure_jump_loop instead. */
> - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
> + /* If N == 0, it should be an on_failure_jump_loop instead.
> + `j` can be negative because `EXTRACT_NUMBER` extracts a
> + signed number whereas `succeed_n` treats it as unsigned. */
> + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));
So this is what prevented me from running with counts above 2**15 and debugging enabled for all these years...
> + if (loop_exit && p2 > loop_exit)
> + {
> + void print_compiled_pattern (struct re_pattern_buffer *bufp);
> + print_compiled_pattern (bufp);
> + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
> + p1 - bufp->buffer, p2 - bufp->buffer,
> + loop_entry - bufp->buffer, loop_exit - bufp->buffer);
> + error ("WOW1!");
#ifdef REGEX_EMACS_DEBUG ?
> + }
> + /* return false; */
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Thu, 21 Sep 2023 18:10:01 GMT)
Full text and
rfc822 format available.
Message #121 received at 65726-done <at> debbugs.gnu.org (full text, mbox):
> So this is what prevented me from running with counts above 2**15 and
> debugging enabled for all these years...
Looks like it.
In any case, I pushed a variant of the last patch to `master` and
I think the case is now closed. We may want to revisit it, presumably
by really keeping track of the all visited nodes, but this seems to work
well for now and I'm confident that it is safe (regardless of the shape
of the bytecode, it should always terminate and it should never apply
the optimization if it's incorrect, and with the shape of the bytecode
we currently produce it does correctly apply the optimization in the cases
I checked).
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#65726
; Package
emacs
.
(Sat, 23 Sep 2023 11:58:02 GMT)
Full text and
rfc822 format available.
Message #124 received at 65726-done <at> debbugs.gnu.org (full text, mbox):
21 sep. 2023 kl. 20.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> In any case, I pushed a variant of the last patch to `master` and
> I think the case is now closed. We may want to revisit it, presumably
> by really keeping track of the all visited nodes, but this seems to work
> well for now and I'm confident that it is safe (regardless of the shape
> of the bytecode, it should always terminate and it should never apply
> the optimization if it's incorrect, and with the shape of the bytecode
> we currently produce it does correctly apply the optimization in the cases
> I checked).
Good, thank you.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 22 Oct 2023 11:24:08 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 201 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.