GNU bug report logs - #65726
29.1.50; Crash in regexp engine

Previous Next

Package: emacs;

Reported by: martin rudalics <rudalics <at> gmx.at>

Date: Mon, 4 Sep 2023 07:48:02 UTC

Severity: normal

Found in version 29.1.50

Fixed in version 30.1

Done: Stefan Kangas <stefankangas <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 65726 in the body.
You can then email your comments to 65726 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 07:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to martin rudalics <rudalics <at> gmx.at>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 04 Sep 2023 07:48:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Bug-Gnu-Emacs <bug-gnu-emacs <at> gnu.org>
Subject: 29.1.50; Crash in regexp engine
Date: Mon, 4 Sep 2023 09:46:24 +0200

[Message part 1 (text/plain, inline)]

With emacs -Q load the attached file elinfo.el.  Now type

C-h S split-window RET

C-x o

C-c C-g

This crashes Emacs here with

Thread 1 "emacs" received signal SIGSEGV, Segmentation fault.
0x000000000068810a in skip_noops (p=<error reading variable: Cannot access memory at address 0x7fffff66fff8>, pend=<error reading variable: Cannot access memory at address 0x7fffff66fff0>) at ../../src/regex-emacs.c:3556
3556	{

and an infinite backtrace starting with

Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8:
#0  0x000000000068810a in skip_noops (p=#1  0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665
#2  0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
#3  0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839
#4  0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839
#5  0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
...

The same scenario worked well with Emacs 22 through 28.

martin


In GNU Emacs 29.1.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.5,
 cairo version 1.16.0) of 2023-09-03 built on restno
Repository revision: f1e4cbe72aa4da9351cbbcd209d9233c68dd9fbb
Repository branch: emacs-29
Windowing system distributor 'The X.Org Foundation', version 11.0.12004000
System Description: Debian GNU/Linux 10 (buster)

Configured using:
 'configure --with-gif=ifavailable --with-tiff=ifavailable
 --with-gnutls=no --without-pop --enable-gcc-warnings=warn-only
 --enable-checking=yes,glyphs --enable-check-lisp-object-type=yes
 'CFLAGS=-O0 -g3 -no-pie -Wno-missing-braces''

Configured features:
CAIRO DBUS FREETYPE GIF GLIB GSETTINGS HARFBUZZ JPEG LIBSELINUX MODULES
NOTIFY INOTIFY PDUMPER PNG SECCOMP SOUND THREADS TOOLKIT_SCROLL_BARS X11
XDBE XIM XINPUT2 XPM GTK3 ZLIB

Important settings:
  value of $LANG: de_AT.utf8
  value of $XMODIFIERS: @im=ibus
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  line-number-mode: t
  indent-tabs-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message mailcap yank-media puny dired
dired-loaddefs rfc822 mml mml-sec password-cache epa derived epg rfc6068
epg-config gnus-util text-property-search time-date subr-x mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
cl-loaddefs cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils elinfo texinfo texinfo-loaddefs info rmc
iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook
vc-hooks lisp-float-type elisp-mode mwheel term/x-win x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode lisp-mode prog-mode register
page tab-bar menu-bar rfn-eshadow isearch easymenu timer select
scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors
frame minibuffer nadvice seq simple cl-generic indonesian philippine
cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese composite emoji-zwj charscript
charprop case-table epa-hook jka-cmpr-hook help abbrev obarray oclosure
cl-preloaded button loaddefs theme-loaddefs faces cus-face macroexp
files window text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget keymap hashtable-print-readable backquote
threads dbusbind inotify dynamic-setting system-font-setting
font-render-setting cairo move-toolbar gtk x-toolkit xinput2 x multi-tty
make-network-process emacs)

Memory information:
((conses 16 45758 8352)
 (symbols 48 5682 0)
 (strings 32 15622 2110)
 (string-bytes 1 431024)
 (vectors 16 10081)
 (vector-slots 8 157669 13073)
 (floats 8 26 23)
 (intervals 56 218 0)
 (buffers 976 10))

[elinfo.el (text/x-emacs-lisp, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 08:45:01 GMT) Full text and rfc822 format available.

Message #8 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: 65726 <at> debbugs.gnu.org
Cc: martin rudalics <rudalics <at> gmx.at>
Subject: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 4 Sep 2023 10:44:06 +0200

Can't reproduce here (macOS). Can you give us a deeper backtrace?
Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 12:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: rudalics <at> gmx.at, 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 04 Sep 2023 15:12:53 +0300

> Cc: martin rudalics <rudalics <at> gmx.at>
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Mon, 4 Sep 2023 10:44:06 +0200
> 
> Can't reproduce here (macOS). Can you give us a deeper backtrace?

I can reproduce on MS-Windows.  It's an infinite recursion.

> Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against?

Here's the end of the call stack:

  #104528 0x0120bbc4 in search_buffer_re (string=XIL(0x8000000007e5e308),
      pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1,
      trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4),
      posix=false) at search.c:1265
  #104529 0x0120cae5 in search_buffer (string=XIL(0x8000000007e5e308),
      pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1, RE=1,
      trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4),
      posix=false) at search.c:1527
  #104530 0x0120b404 in search_command (string=XIL(0x8000000007e5e308),
      bound=XIL(0), noerror=XIL(0x30), count=XIL(0), direction=1, RE=1,
      posix=false) at search.c:1069
  #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
      bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294
  #104532 0x0127860e in eval_sub (form=XIL(0xc000000007dbef40)) at eval.c:2508
  #104533 0x0126f601 in Fand (args=XIL(0xc000000007dbe810)) at eval.c:370
  #104534 0x012780e0 in eval_sub (form=XIL(0xc000000007dbef30)) at eval.c:2449
  #104535 0x0126f847 in Fcond (args=XIL(0xc000000007e16d00)) at eval.c:412
  #104536 0x012780e0 in eval_sub (form=XIL(0xc000000007e16d70)) at eval.c:2449
  #104537 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
  #104538 0x01272b16 in Flet (args=XIL(0xc000000007e16d90)) at eval.c:1026
  #104539 0x012780e0 in eval_sub (form=XIL(0xc000000007e16da0)) at eval.c:2449
  #104540 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
  #104541 0x012780e0 in eval_sub (form=XIL(0xc000000007e17740)) at eval.c:2449
  #104542 0x0126f74a in Fif (args=XIL(0xc000000007e16dd0)) at eval.c:391
  #104543 0x012780e0 in eval_sub (form=XIL(0xc000000007e16dc0)) at eval.c:2449
  #104544 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
  #104545 0x01272b16 in Flet (args=XIL(0xc000000007e16e00)) at eval.c:1026
  #104546 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e10)) at eval.c:2449
  #104547 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
  #104548 0x0126f7ae in Fif (args=XIL(0xc000000007e16e30)) at eval.c:392
  #104549 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e20)) at eval.c:2449
  #104550 0x0126fa9d in Fprogn (body=XIL(0xc000000007dbeb50)) at eval.c:436
  #104551 0x0127247c in FletX (args=XIL(0xc000000007e16fc0)) at eval.c:958
  #104552 0x012780e0 in eval_sub (form=XIL(0xc000000007e16fd0)) at eval.c:2449
  #104553 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436
  #104554 0x0127b95e in funcall_lambda (fun=XIL(0xc000000007dbe750), nargs=0,
      arg_vector=0x82ecb0) at eval.c:3233
  #104555 0x0127a4c9 in funcall_general (fun=XIL(0xc000000007dbe750),
      numargs=0, args=0x82ecb0) at eval.c:2957
  #104556 0x0127a6b7 in Ffuncall (nargs=1, args=0x82eca8) at eval.c:2995
  #104557 0x012693e3 in Ffuncall_interactively (nargs=1, args=0x82eca8)
      at callint.c:250
  #104558 0x0127ad17 in funcall_subr (subr=0x1881580 <Sfuncall_interactively>,
      numargs=1, args=0x82eca8) at eval.c:3059
  #104559 0x0127a2cb in funcall_general (fun=XIL(0xa000000001881580),
      numargs=1, args=0x82eca8) at eval.c:2941
  #104560 0x0127a6b7 in Ffuncall (nargs=2, args=0x82eca0) at eval.c:2995
  #104561 0x012790cb in Fapply (nargs=3, args=0x82eca0) at eval.c:2619
  #104562 0x01269ae9 in Fcall_interactively (function=XIL(0x60de7c0),
      record_flag=XIL(0), keys=XIL(0xa000000007970198)) at callint.c:342
  #104563 0x0127a915 in funcall_subr (subr=0x18815c0 <Scall_interactively>,
      numargs=3, args=0x6c50078) at eval.c:3038
  #104564 0x012ed316 in exec_byte_code (fun=XIL(0xa00000000615797c),
      args_template=1025, nargs=1, args=0x82f590) at bytecode.c:809
  #104565 0x0127ae66 in fetch_and_exec_byte_code (fun=XIL(0xa00000000615797c),
      args_template=1025, nargs=1, args=0x82f588) at eval.c:3081
  #104566 0x0127b3c5 in funcall_lambda (fun=XIL(0xa00000000615797c), nargs=1,
      arg_vector=0x82f588) at eval.c:3153
  #104567 0x0127a332 in funcall_general (fun=XIL(0xa00000000615797c),
      numargs=1, args=0x82f588) at eval.c:2945
  #104568 0x0127a6b7 in Ffuncall (nargs=2, args=0x82f580) at eval.c:2995
  #104569 0x0116d992 in call1 (fn=XIL(0x4590), arg1=XIL(0x60de7c0))
      at lisp.h:3248
  #104570 0x01171cb7 in command_loop_1 () at keyboard.c:1503
  #104571 0x01274470 in internal_condition_case (
      bfun=0x11710ec <command_loop_1>, handlers=XIL(0x90),
      hfun=0x11700ba <cmd_error>) at eval.c:1474
  #104572 0x01170b59 in command_loop_2 (handlers=XIL(0x90)) at keyboard.c:1133
  #104573 0x012732f7 in internal_catch (tag=XIL(0x103b0),
      func=0x1170b22 <command_loop_2>, arg=XIL(0x90)) at eval.c:1197
  #104574 0x01170ac4 in command_loop () at keyboard.c:1111
  #104575 0x0116fb1a in recursive_edit_1 () at keyboard.c:720
  #104576 0x0116fdb8 in Frecursive_edit () at keyboard.c:803
  #104577 0x0116ab60 in main (argc=2, argv=0xa428e0) at emacs.c:2521

  Lisp Backtrace:
  "re-search-forward" (0x82d6a0)
  "and" (0x82d8a0)
  "cond" (0x82da60)
  "let" (0x82dcf0)
  "progn" (0x82deb0)
  "if" (0x82e070)
  "let" (0x82e300)
  "if" (0x82e500)
  "let*" (0x82e770)
  "elinfo-goto-texi" (0x82ecb0)
  "funcall-interactively" (0x82eca8)
  "call-interactively" (0x6c50078)
  "command-execute" (0x82f588)
(gdb)

In frame #104531, we have:

  #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
      bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294

and regexp is this monstrocity:

    "\\(split\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)-\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)&optional\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)size\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)side\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)pixelwise\\)\\|^node"

Feel free to ask more questions, as I can reproduce this at will.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 13:19:02 GMT) Full text and rfc822 format available.

Message #14 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: martin rudalics <rudalics <at> gmx.at>, 65726 <at> debbugs.gnu.org,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 4 Sep 2023 15:18:01 +0200

4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz <at> gnu.org>:

> I can reproduce on MS-Windows.  It's an infinite recursion.

Hmmm, strange that I'm unable to then.

The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade?

I shall give this another go but meanwhile maybe Stefan could look at it, too?
Not that I am implything there is anything wrong with his code. Heavens, no.

> In frame #104531, we have:
> 
>  #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308),
>      bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294

That should be the re-search-forward call in elinfo.el:973.

> and regexp is this monstrocity:

Thank you, this one is apparently generated dynamically.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 13:27:01 GMT) Full text and rfc822 format available.

Message #17 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: martin rudalics <rudalics <at> gmx.at>, 65726 <at> debbugs.gnu.org,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 4 Sep 2023 15:26:07 +0200

> Hmmm, strange that I'm unable to then.

Now I can reproduce the crash (I didn't have any TAGS files).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 13:29:02 GMT) Full text and rfc822 format available.

Message #20 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: rudalics <at> gmx.at, 65726 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 04 Sep 2023 16:28:27 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Mon, 4 Sep 2023 15:18:01 +0200
> Cc: 65726 <at> debbugs.gnu.org,
>  martin rudalics <rudalics <at> gmx.at>,
>  Stefan Monnier <monnier <at> iro.umontreal.ca>
> 
> 4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > I can reproduce on MS-Windows.  It's an infinite recursion.
> 
> Hmmm, strange that I'm unable to then.
> 
> The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade?

Yes, if I revert that change, the problem goes away.

> I shall give this another go but meanwhile maybe Stefan could look at it, too?
> Not that I am implything there is anything wrong with his code. Heavens, no.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 14:33:02 GMT) Full text and rfc822 format available.

Message #23 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: martin rudalics <rudalics <at> gmx.at>
Cc: 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 04 Sep 2023 10:32:38 -0400

> Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8:
> #0  0x000000000068810a in skip_noops (p=#1  0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665
> #2  0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
> #3  0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839
> #4  0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839
> #5  0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838
> ...

Hmm... the line numbers strongly suggests the inf-recursion happens via
the calls:

    case on_failure_jump:
      {
        int mcnt;
	p2++;
	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
	/* Don't just test `mcnt > 0` because non-greedy loops have
	   their test at the end with an unconditional jump at the start.  */
	if (p2 + mcnt > p2_orig) /* Ensure forward progress.  */
	  return (mutually_exclusive_p (bufp, p1, p2)
		  && mutually_exclusive_p (bufp, p1, p2 + mcnt));
	break;
      }

Re-reading the code I see that `skip_noops` can return a position
smaller than its argument, which makes it possible for `p2` to
be smaller (or equal) to `p2_orig` and hence explain that inf-loop
(that's the only path I can see that explains the inf-loop you're
seeing).

So, the patch below should hopefully fix your problem.


        Stefan


diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 7e75f0ac597..3a14c10771d 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3832,7 +3832,8 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
 	/* Don't just test `mcnt > 0` because non-greedy loops have
 	   their test at the end with an unconditional jump at the start.  */
-	if (p2 + mcnt > p2_orig) /* Ensure forward progress.  */
+	if (p2 + mcnt > p2_orig /* Ensure forward progress.  */
+	    && p2 > p2_orig)    /* Bug#65726  */
 	  return (mutually_exclusive_p (bufp, p1, p2)
 		  && mutually_exclusive_p (bufp, p1, p2 + mcnt));
 	break;

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 15:48:01 GMT) Full text and rfc822 format available.

Message #26 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rudalics <at> gmx.at, 65726 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 4 Sep 2023 17:47:04 +0200

Minimised reproduction recipe:

  (string-match (rx (* "a") (* (* "b"))) "a")

(Don't worry about the nested stars; the original was more like (* (| A (* B))).)

The generated bytecode is:

    0  on-failure-jump-smart to 9
    3  exact "a"
    6  jump to 0
    9  on-failure-jump-loop to 24
   12  on-failure-jump-smart to 21
   15  exact "b"
   18  jump to 12
   21  jump to 9
   24  succeed

which suggests a mutual_exclusive_p recursion loop:

  mep(3,9)
  -> mep(3,12)
     -> mep(3,21) = {skip_noop} = mep(3,9)
        -> mep(3,12)
           ...

Maybe there is something clever we could do here but it's possible that we just can't avoid keeping track of visited nodes.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 15:59:01 GMT) Full text and rfc822 format available.

Message #29 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: rudalics <at> gmx.at, 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 04 Sep 2023 18:57:41 +0300

> Cc: 65726 <at> debbugs.gnu.org
> Date: Mon, 04 Sep 2023 10:32:38 -0400
> From:  Stefan Monnier via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> Re-reading the code I see that `skip_noops` can return a position
> smaller than its argument, which makes it possible for `p2` to
> be smaller (or equal) to `p2_orig` and hence explain that inf-loop
> (that's the only path I can see that explains the inf-loop you're
> seeing).
> 
> So, the patch below should hopefully fix your problem.

Yes, with this patch the problem is gone.

Please install on the release branch, when you are satisfied with the
solution, and thanks.

Reply sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
You have taken responsibility. (Mon, 04 Sep 2023 17:13:02 GMT) Full text and rfc822 format available.

Notification sent to martin rudalics <rudalics <at> gmx.at>:
bug acknowledged by developer. (Mon, 04 Sep 2023 17:13:02 GMT) Full text and rfc822 format available.

Message #34 received at 65726-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rudalics <at> gmx.at, 65726-done <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 04 Sep 2023 13:12:20 -0400

> Yes, with this patch the problem is gone.

Thanks, pushed to `emacs-29`.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 04 Sep 2023 17:13:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Tue, 05 Sep 2023 07:15:02 GMT) Full text and rfc822 format available.

Message #40 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Tue, 5 Sep 2023 09:14:23 +0200

> So, the patch below should hopefully fix your problem.

It fixes the problem.

Thanks, martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Tue, 05 Sep 2023 12:24:01 GMT) Full text and rfc822 format available.

Message #43 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Tue, 5 Sep 2023 14:23:19 +0200

Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack overflow. It seems that by fixing this bug, we've unfixed the one that the broken code was supposed to fix.

And we should probably include a regression test for this bug as well. The previously mentioned

  (string-match (rx (* "a") (* (* "b"))) "a")

might suffice, but extending it to

  (string-match (rx (* "a") (* (or "c" (* "b")))) "a")

is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X).

Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 05 Sep 2023 12:35:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Tue, 05 Sep 2023 13:09:01 GMT) Full text and rfc822 format available.

Message #48 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Tue, 05 Sep 2023 09:08:01 -0400

> Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack
> overflow.

Really?  I thought I had specifically run that test.
[ ...recompiling + rerunning the test... ]
Hmm... you're right.  Crap.

> It seems that by fixing this bug, we've unfixed the one that the
> broken code was supposed to fix.

Well, that one was less of a bug (more of a missing optimization), so
it's no reason to revert the change, but yes, not good.
Let's see if I can finagle something.

> And we should probably include a regression test for this bug as well.
> The previously mentioned
>
>   (string-match (rx (* "a") (* (* "b"))) "a")
>
> might suffice, but extending it to
>
>   (string-match (rx (* "a") (* (or "c" (* "b")))) "a")
>
> is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X).

Feel free.  To me, it feels too much like testing the presence/absence
of a very specific bug (i.e. too unlikely that we'll reintroduce that
exact bug.  It's not like it was a weird case I didn't think of).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Tue, 05 Sep 2023 13:51:01 GMT) Full text and rfc822 format available.

Message #51 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Tue, 5 Sep 2023 15:50:05 +0200

5 sep. 2023 kl. 15.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> Let's see if I can finagle something.

I liked your previous optimisation, so if you could finagle it back without breaking anything that would be pukka.

> Feel free.  To me, it feels too much like testing the presence/absence
> of a very specific bug (i.e. too unlikely that we'll reintroduce that
> exact bug.  It's not like it was a weird case I didn't think of).

Alright, pushed to emacs-29. It's a cheap test and offers at least some kind of protection in the event that someone undoes your handiwork. Such things do happen after all.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Tue, 05 Sep 2023 15:34:01 GMT) Full text and rfc822 format available.

Message #54 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Tue, 05 Sep 2023 11:33:05 -0400

>> Feel free.  To me, it feels too much like testing the presence/absence
>> of a very specific bug (i.e. too unlikely that we'll reintroduce that
>> exact bug.  It's not like it was a weird case I didn't think of).
> Alright, pushed to emacs-29. It's a cheap test and offers at least some kind
> of protection in the event that someone undoes your handiwork. Such things
> do happen after all.

I marked the backtracking test as failing because the "real" fix wil
likely be too risky to put onto `emacs-29` (at least for now).

>> Let's see if I can finagle something.
> I liked your previous optimisation, so if you could finagle it back without
> breaking anything that would be pukka.

WDYT of the approach below?
You can ignore the `mutually_exclusive_(charset|exactn)` thingies which
just move code around (it's necessary for the patch).

It's a bit ugly because the actual assumptions on which it relies aren't
spelled out explicitly (and even less assert-checked at run time).

The basic idea is that `done` points to the beginning of the "current
loop": as long as we move forward we presume we may still be in the
current loop, and when we find a jump backward it's either a jump to the
beginning of a new (nested/further) loop (so we move `done`) or a jump
to the same loop or to some earlier loop (which we presumably handle
elsewhere).

I also tried a safer option where we do:

	return ((p2 < done ? false
		 : p2 == done ? true
		 : mutually_exclusive_aux (bufp, p1, p2,
		                           p2 > p2_orig ? done : p2))
		&& (p2_other < done ? false
		    : p2_other == done ? true
		    : mutually_exclusive_aux (bufp, p1, p2_other,
		                              p2_other > p2_orig
		                              ? done : p2_other)));

i.e. we only consider `done` itself as "already checked", which should
definitely be safe.  It also seems to be sufficient to handle things
like:

    (should (looking-at "x*\\(=+?\\|h\\)+"))

Adding an `assert (p2 >= done)` confirms that this does happen, so
whether we return true or false when `p2 < done` does matter, so I guess
we should go with the safer option unless we can really convince
ourselves that the more optimistic option is also correct.

Then again, maybe we should bite the bullet and actually keep track of
the positions already visited, so we don't need any "fancy" argument.


        Stefan


diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 7e75f0ac597..0d3cf39d619 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
   return not;
 }
 
+/* Case where `p2` points to an `exactn`.  */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+		           re_char *p2)
+{
+  bool multibyte = RE_MULTIBYTE_P (bufp);
+  int c
+    = (re_opcode_t) *p2 == endline ? '\n'
+      : RE_STRING_CHAR (p2 + 2, multibyte);
+
+  if ((re_opcode_t) *p1 == exactn)
+    {
+      if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+	{
+	  DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
+	  return true;
+	}
+    }
+
+  else if ((re_opcode_t) *p1 == charset
+	   || (re_opcode_t) *p1 == charset_not)
+    {
+      if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+                            Qnil))
+	{
+	  DEBUG_PRINT ("	 No match => fast loop.\n");
+	  return true;
+	}
+    }
+  else if ((re_opcode_t) *p1 == anychar
+	   && c == '\n')
+    {
+      DEBUG_PRINT ("   . != \\n => fast loop.\n");
+      return true;
+    }
+  return false;
+}
+
+/* Case where `p2` points to an `charset`.  */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+		            re_char *p2)
+{
+  /* It is hard to list up all the character in charset
+     P2 if it includes multibyte character.  Give up in
+     such case.  */
+  if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+    {
+      /* Now, we are sure that P2 has no range table.
+	     So, for the size of bitmap in P2, 'p2[1]' is
+	     enough.  But P1 may have range table, so the
+	     size of bitmap table of P1 is extracted by
+	     using macro 'CHARSET_BITMAP_SIZE'.
+
+	     In a multibyte case, we know that all the character
+	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
+	     bitmap table.  So, in both cases, it is enough to test
+	     only the bitmap table of P1.  */
+
+      if ((re_opcode_t) *p1 == charset)
+	{
+	  int idx;
+	  /* We win if the charset inside the loop
+		 has no overlap with the one after the loop.  */
+	  for (idx = 0;
+		(idx < (int) p2[1]
+		 && idx < CHARSET_BITMAP_SIZE (p1));
+		idx++)
+	    if ((p2[2 + idx] & p1[2 + idx]) != 0)
+	      break;
+
+	  if (idx == p2[1]
+	      || idx == CHARSET_BITMAP_SIZE (p1))
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+      else if ((re_opcode_t) *p1 == charset_not)
+	{
+	  int idx;
+	  /* We win if the charset_not inside the loop lists
+		 every character listed in the charset after.  */
+	  for (idx = 0; idx < (int) p2[1]; idx++)
+	    if (! (p2[2 + idx] == 0
+		   || (idx < CHARSET_BITMAP_SIZE (p1)
+		       && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+	      break;
+
+	  if (idx == p2[1])
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+    }
+  return false;
+}
+
 /* True if "p1 matches something" implies "p2 fails".  */
+/* Avoiding inf-loops:
+   We're trying to follow all paths reachable from `p2`, but since some
+   loops can match the empty string, this can loop back to `p2`.
+   To avoid inf-looping, we rely on a lexicographic ordering using `done`
+   and `p2`: at every recursion, either `done` is larger, or it
+   is unchanged and `p2` is larger.  */
 static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
-		      re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+		        re_char *p2, re_char *done)
 {
   re_opcode_t op2;
-  bool multibyte = RE_MULTIBYTE_P (bufp);
   unsigned char *pend = bufp->buffer + bufp->used;
   re_char *p2_orig = p2;
 
   eassert (p1 >= bufp->buffer && p1 < pend
 	   && p2 >= bufp->buffer && p2 <= pend);
 
+  eassert (p2 >= done);
+
   /* Skip over open/close-group commands.
      If what follows this loop is a ...+ construct,
      look at what begins its body, since we will have to
@@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 
     case endline:
     case exactn:
-      {
-	int c
-	  = (re_opcode_t) *p2 == endline ? '\n'
-	  : RE_STRING_CHAR (p2 + 2, multibyte);
-
-	if ((re_opcode_t) *p1 == exactn)
-	  {
-	    if (c != RE_STRING_CHAR (p1 + 2, multibyte))
-	      {
-		DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
-		return true;
-	      }
-	  }
-
-	else if ((re_opcode_t) *p1 == charset
-		 || (re_opcode_t) *p1 == charset_not)
-	  {
-	    if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
-                                  Qnil))
-	      {
-		DEBUG_PRINT ("	 No match => fast loop.\n");
-		return true;
-	      }
-	  }
-	else if ((re_opcode_t) *p1 == anychar
-		 && c == '\n')
-	  {
-	    DEBUG_PRINT ("   . != \\n => fast loop.\n");
-	    return true;
-	  }
-      }
-      break;
+      return mutually_exclusive_exactn (bufp, p1, p2);
 
     case charset:
       {
 	if ((re_opcode_t) *p1 == exactn)
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
-
-      /* It is hard to list up all the character in charset
-	 P2 if it includes multibyte character.  Give up in
-	 such case.  */
-      else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
-	{
-	  /* Now, we are sure that P2 has no range table.
-	     So, for the size of bitmap in P2, 'p2[1]' is
-	     enough.  But P1 may have range table, so the
-	     size of bitmap table of P1 is extracted by
-	     using macro 'CHARSET_BITMAP_SIZE'.
-
-	     In a multibyte case, we know that all the character
-	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
-	     bitmap table.  So, in both cases, it is enough to test
-	     only the bitmap table of P1.  */
-
-	  if ((re_opcode_t) *p1 == charset)
-	    {
-	      int idx;
-	      /* We win if the charset inside the loop
-		 has no overlap with the one after the loop.  */
-	      for (idx = 0;
-		   (idx < (int) p2[1]
-		    && idx < CHARSET_BITMAP_SIZE (p1));
-		   idx++)
-		if ((p2[2 + idx] & p1[2 + idx]) != 0)
-		  break;
-
-	      if (idx == p2[1]
-		  || idx == CHARSET_BITMAP_SIZE (p1))
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	    }
-	  else if ((re_opcode_t) *p1 == charset_not)
-	    {
-	      int idx;
-	      /* We win if the charset_not inside the loop lists
-		 every character listed in the charset after.  */
-	      for (idx = 0; idx < (int) p2[1]; idx++)
-		if (! (p2[2 + idx] == 0
-		       || (idx < CHARSET_BITMAP_SIZE (p1)
-			   && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
-		  break;
-
-	      if (idx == p2[1])
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	      }
-	  }
+	  return mutually_exclusive_exactn (bufp, p2, p1);
+	else
+	  return mutually_exclusive_charset (bufp, p1, p2);
       }
       break;
 
@@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
       switch (*p1)
 	{
 	case exactn:
+	  return mutually_exclusive_exactn (bufp, p2, p1);
 	case charset:
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
+	  return mutually_exclusive_charset (bufp, p2, p1);
 	case charset_not:
 	  /* When we have two charset_not, it's very unlikely that
 	     they don't overlap.  The union of the two sets of excluded
@@ -3830,11 +3852,18 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
         int mcnt;
 	p2++;
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
-	/* Don't just test `mcnt > 0` because non-greedy loops have
-	   their test at the end with an unconditional jump at the start.  */
-	if (p2 + mcnt > p2_orig) /* Ensure forward progress.  */
-	  return (mutually_exclusive_p (bufp, p1, p2)
-		  && mutually_exclusive_p (bufp, p1, p2 + mcnt));
+	re_char *p2_other = p2 + mcnt;
+	/* Presumably, any position up to `done` should be a position we
+	   already considered elsewhere (because our state machines are not
+	   arbitrary: they consists of syntaxically nested loops with limited
+	   control flow), so we can consider those back jumps as mutually
+	   exclusive here.  */
+	return ((p2 <= done
+		 || mutually_exclusive_aux (bufp, p1, p2,
+		                            p2 > p2_orig ? done : p2))
+		&& (p2_other <= done
+		    || mutually_exclusive_aux (bufp, p1, p2_other,
+		                               p2_other > p2_orig ? done : p2_other)));
 	break;
       }
 
@@ -3846,6 +3875,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
   return false;
 }
 
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+		      re_char *p2)
+{
+  return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1));
+}
 
 /* Matching routines.  */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Wed, 06 Sep 2023 12:04:02 GMT) Full text and rfc822 format available.

Message #57 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Wed, 6 Sep 2023 14:03:01 +0200

5 sep. 2023 kl. 17.33 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> You can ignore the `mutually_exclusive_(charset|exactn)` thingies which
> just move code around (it's necessary for the patch).

Actually that's a good refactoring to start with regardless of the path we follow later. Makes things more readable if nothing else.

> The basic idea is that `done` points to the beginning of the "current
> loop": as long as we move forward we presume we may still be in the
> current loop, and when we find a jump backward it's either a jump to the
> beginning of a new (nested/further) loop (so we move `done`) or a jump
> to the same loop or to some earlier loop (which we presumably handle
> elsewhere).

So how can we describe the exact semantics of `done` in mutually_exclusive_aux(p1,p2,done)?
Something like:

Let me(p1,p2) = true iff (p1 matches) implies (p2 fails).
Then me_aux(p1,p2,done) = me(p1,p2)
under the assumption that me(p1,x) holds for all x<done where x is reachable from p2.

?

> Adding an `assert (p2 >= done)` confirms that this does happen,

Can you give an example of such a regexp? I ran `make check` with your patch applied and checking enabled but got no assertion failure.

> so
> whether we return true or false when `p2 < done` does matter, so I guess
> we should go with the safer option unless we can really convince
> ourselves that the more optimistic option is also correct.
> 
> Then again, maybe we should bite the bullet and actually keep track of
> the positions already visited, so we don't need any "fancy" argument.

There are a couple of ways of doing that but it might not be that bad. It would also be more robust in face of future changes that may not obey the current invariants.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 09 Sep 2023 15:56:02 GMT) Full text and rfc822 format available.

Message #60 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sat, 09 Sep 2023 11:55:28 -0400

>> Adding an `assert (p2 >= done)` confirms that this does happen,
>
> Can you give an example of such a regexp? I ran `make check` with your patch
> applied and checking enabled but got no assertion failure.

My patch didn't include the assertion :-)
The patch below does and hits an assertion failure during the dump:

    Loading format...
    Loading bindings...
    WOW1!
    
    Error: error ("WOW1!")
      mapbacktrace(#[1028 "\1\4\203\24\0\301\302!\210\300\4!\210\301\303!\210\202\35\0\301\304!\210\3\3B\262\1\211\2035\0\300\1@!\210\211A\211\262\2\2035\0\301\305!\210\202!\0\301\306!\207" [prin1 princ "  " "(" "  (" " " ")\n"] 7 "\n\n(fn EVALD FUNC ARGS FLAGS)"])
      debug-early-backtrace()
      debug-early(error (error "WOW1!"))
      string-match("\\`[^ ]+\\( [^ ]+\\)*\\'" "h r" nil t)
      key-valid-p("h r")
      keymap--check("h r")
      keymap-set((keymap (27 keymap (119 . eww-search-words)) (111 . occur)) "h r" highlight-regexp)
      define-keymap("o" occur "M-w" eww-search-words "h r" highlight-regexp "h p" highlight-phrase "h l" highlight-lines-matching-regexp "h ." highlight-symbol-at-point "h u" unhighlight-regexp "h f" hi-lock-find-patterns "h w" hi-lock-write-interactive-patterns)
      (defvar search-map (define-keymap "o" 'occur "M-w" 'eww-search-words "h r" 'highlight-regexp "h p" 'highlight-phrase "h l" 'highlight-lines-matching-regexp "h ." 'highlight-symbol-at-point "h u" 'unhighlight-regexp "h f" 'hi-lock-find-patterns "h w" 'hi-lock-write-interactive-patterns) ("/home/monnier/src/emacs/main/lisp/bindings.elc" . 40677))
      load("bindings")
      load("loadup.el")


-- Stefan


diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 394ba22e9b0..b47cdfbfef8 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
   return not;
 }
 
+/* Case where `p2` points to an `exactn`.  */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+		           re_char *p2)
+{
+  bool multibyte = RE_MULTIBYTE_P (bufp);
+  int c
+    = (re_opcode_t) *p2 == endline ? '\n'
+      : RE_STRING_CHAR (p2 + 2, multibyte);
+
+  if ((re_opcode_t) *p1 == exactn)
+    {
+      if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+	{
+	  DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
+	  return true;
+	}
+    }
+
+  else if ((re_opcode_t) *p1 == charset
+	   || (re_opcode_t) *p1 == charset_not)
+    {
+      if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+                            Qnil))
+	{
+	  DEBUG_PRINT ("	 No match => fast loop.\n");
+	  return true;
+	}
+    }
+  else if ((re_opcode_t) *p1 == anychar
+	   && c == '\n')
+    {
+      DEBUG_PRINT ("   . != \\n => fast loop.\n");
+      return true;
+    }
+  return false;
+}
+
+/* Case where `p2` points to an `charset`.  */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+		            re_char *p2)
+{
+  /* It is hard to list up all the character in charset
+     P2 if it includes multibyte character.  Give up in
+     such case.  */
+  if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+    {
+      /* Now, we are sure that P2 has no range table.
+	     So, for the size of bitmap in P2, 'p2[1]' is
+	     enough.  But P1 may have range table, so the
+	     size of bitmap table of P1 is extracted by
+	     using macro 'CHARSET_BITMAP_SIZE'.
+
+	     In a multibyte case, we know that all the character
+	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
+	     bitmap table.  So, in both cases, it is enough to test
+	     only the bitmap table of P1.  */
+
+      if ((re_opcode_t) *p1 == charset)
+	{
+	  int idx;
+	  /* We win if the charset inside the loop
+		 has no overlap with the one after the loop.  */
+	  for (idx = 0;
+		(idx < (int) p2[1]
+		 && idx < CHARSET_BITMAP_SIZE (p1));
+		idx++)
+	    if ((p2[2 + idx] & p1[2 + idx]) != 0)
+	      break;
+
+	  if (idx == p2[1]
+	      || idx == CHARSET_BITMAP_SIZE (p1))
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+      else if ((re_opcode_t) *p1 == charset_not)
+	{
+	  int idx;
+	  /* We win if the charset_not inside the loop lists
+		 every character listed in the charset after.  */
+	  for (idx = 0; idx < (int) p2[1]; idx++)
+	    if (! (p2[2 + idx] == 0
+		   || (idx < CHARSET_BITMAP_SIZE (p1)
+		       && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+	      break;
+
+	  if (idx == p2[1])
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+    }
+  return false;
+}
+
 /* True if "p1 matches something" implies "p2 fails".  */
+/* Avoiding inf-loops:
+   We're trying to follow all paths reachable from `p2`, but since some
+   loops can match the empty string, this can loop back to `p2`.
+   To avoid inf-looping, we rely on a lexicographic ordering using `done`
+   and `p2`: at every recursion, either `done` is larger, or it
+   is unchanged and `p2` is larger.  */
 static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
-		      re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+		        re_char *p2, re_char *done)
 {
   re_opcode_t op2;
-  bool multibyte = RE_MULTIBYTE_P (bufp);
   unsigned char *pend = bufp->buffer + bufp->used;
   re_char *p2_orig = p2;
 
   eassert (p1 >= bufp->buffer && p1 < pend
 	   && p2 >= bufp->buffer && p2 <= pend);
 
+  eassert (p2 >= done);
+
   /* Skip over open/close-group commands.
      If what follows this loop is a ...+ construct,
      look at what begins its body, since we will have to
@@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 
     case endline:
     case exactn:
-      {
-	int c
-	  = (re_opcode_t) *p2 == endline ? '\n'
-	  : RE_STRING_CHAR (p2 + 2, multibyte);
-
-	if ((re_opcode_t) *p1 == exactn)
-	  {
-	    if (c != RE_STRING_CHAR (p1 + 2, multibyte))
-	      {
-		DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
-		return true;
-	      }
-	  }
-
-	else if ((re_opcode_t) *p1 == charset
-		 || (re_opcode_t) *p1 == charset_not)
-	  {
-	    if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
-                                  Qnil))
-	      {
-		DEBUG_PRINT ("	 No match => fast loop.\n");
-		return true;
-	      }
-	  }
-	else if ((re_opcode_t) *p1 == anychar
-		 && c == '\n')
-	  {
-	    DEBUG_PRINT ("   . != \\n => fast loop.\n");
-	    return true;
-	  }
-      }
-      break;
+      return mutually_exclusive_exactn (bufp, p1, p2);
 
     case charset:
       {
 	if ((re_opcode_t) *p1 == exactn)
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
-
-      /* It is hard to list up all the character in charset
-	 P2 if it includes multibyte character.  Give up in
-	 such case.  */
-      else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
-	{
-	  /* Now, we are sure that P2 has no range table.
-	     So, for the size of bitmap in P2, 'p2[1]' is
-	     enough.  But P1 may have range table, so the
-	     size of bitmap table of P1 is extracted by
-	     using macro 'CHARSET_BITMAP_SIZE'.
-
-	     In a multibyte case, we know that all the character
-	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
-	     bitmap table.  So, in both cases, it is enough to test
-	     only the bitmap table of P1.  */
-
-	  if ((re_opcode_t) *p1 == charset)
-	    {
-	      int idx;
-	      /* We win if the charset inside the loop
-		 has no overlap with the one after the loop.  */
-	      for (idx = 0;
-		   (idx < (int) p2[1]
-		    && idx < CHARSET_BITMAP_SIZE (p1));
-		   idx++)
-		if ((p2[2 + idx] & p1[2 + idx]) != 0)
-		  break;
-
-	      if (idx == p2[1]
-		  || idx == CHARSET_BITMAP_SIZE (p1))
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	    }
-	  else if ((re_opcode_t) *p1 == charset_not)
-	    {
-	      int idx;
-	      /* We win if the charset_not inside the loop lists
-		 every character listed in the charset after.  */
-	      for (idx = 0; idx < (int) p2[1]; idx++)
-		if (! (p2[2 + idx] == 0
-		       || (idx < CHARSET_BITMAP_SIZE (p1)
-			   && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
-		  break;
-
-	      if (idx == p2[1])
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	      }
-	  }
+	  return mutually_exclusive_exactn (bufp, p2, p1);
+	else
+	  return mutually_exclusive_charset (bufp, p1, p2);
       }
       break;
 
@@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
       switch (*p1)
 	{
 	case exactn:
+	  return mutually_exclusive_exactn (bufp, p2, p1);
 	case charset:
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
+	  return mutually_exclusive_charset (bufp, p2, p1);
 	case charset_not:
 	  /* When we have two charset_not, it's very unlikely that
 	     they don't overlap.  The union of the two sets of excluded
@@ -3830,11 +3852,25 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
         int mcnt;
 	p2++;
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
-	/* Don't just test `mcnt > 0` because non-greedy loops have
-	   their test at the end with an unconditional jump at the start.  */
-	if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress.  */
-	  return (mutually_exclusive_p (bufp, p1, p2)
-		  && mutually_exclusive_p (bufp, p1, p2 + mcnt));
+	re_char *p2_other = p2 + mcnt;
+	/* Presumably, any position up to `done` should be a position we
+	   already considered elsewhere (because our state machines are not
+	   arbitrary: they consists of syntaxically nested loops with limited
+	   control flow), so we can consider those back jumps as mutually
+	   exclusive here.  */
+	if (p2 < done)
+	  error ("WOW1!");
+	if (p2_other < done)
+	  error ("WOW2!");
+	return ((p2 < done ? true
+		 : p2 == done ? true
+		 : mutually_exclusive_aux (bufp, p1, p2,
+		                           p2 > p2_orig ? done : p2))
+		&& (p2_other < done ? true
+		    : p2_other == done ? true
+		    : mutually_exclusive_aux (bufp, p1, p2_other,
+		                              p2_other > p2_orig
+		                              ? done : p2_other)));
 	break;
       }
 
@@ -3846,6 +3882,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
   return false;
 }
 
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+		      re_char *p2)
+{
+  return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1));
+}
 
 /* Matching routines.  */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 09 Sep 2023 16:35:03 GMT) Full text and rfc822 format available.

Message #63 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sat, 9 Sep 2023 18:34:42 +0200

9 sep. 2023 kl. 17.55 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> My patch didn't include the assertion :-)

Oh, I thought you meant the eassert (p2 >= done); added at the start of mutually_exclusive_aux. Sorry about the confusion.
Anyway, disassembling the regexp makes it clear why:

Disassembly of regexp "\\`[^ ]+\\( [^ ]+\\)*\\'"

    0  begbuf
    1  charset-not [ ]
    8  on-failure-jump-smart to 21
   11  charset-not [ ]
   18  jump to 8
   21  on-failure-jump to 54
   24  start-memory group 1
   26  exact " "
   29  charset-not [ ]
   36  on-failure-jump-smart to 49
   39  charset-not [ ]
   46  jump to 36
   49  stop-memory group 1
   51  jump to 21
   54  endbuf
   55  succeed

which unless I'm mistaken we can condense to:

Disassembly of regexp "\\(?:ba*\\)*"

    0  on-failure-jump to 18
    3  exact "b"
    6  on-failure-jump-smart to 15
    9  exact "a"
   12  jump to 6
   15  jump to 0
   18  succeed

so yes, we may need to remember where we've been. (At this point someone will inevitably point out a helpful invariant that is obvious in hindsight. This is just my cunning attempt at making that happen.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sun, 10 Sep 2023 07:51:02 GMT) Full text and rfc822 format available.

Message #66 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: rudalics <at> gmx.at, Eli Zaretskii <eliz <at> gnu.org>, 65726 <at> debbugs.gnu.org,
 65726-done <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sun, 10 Sep 2023 00:50:48 -0700

Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

>> Yes, with this patch the problem is gone.
>
> Thanks, pushed to `emacs-29`.

Is there anything more to do here, or can this bug report be closed?

Reply sent to Stefan Kangas <stefankangas <at> gmail.com>:
You have taken responsibility. (Sun, 10 Sep 2023 07:51:02 GMT) Full text and rfc822 format available.

Notification sent to martin rudalics <rudalics <at> gmx.at>:
bug acknowledged by developer. (Sun, 10 Sep 2023 07:51:02 GMT) Full text and rfc822 format available.

bug Marked as fixed in versions 30.1. Request was from Stefan Kangas <stefankangas <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 10 Sep 2023 07:52:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sun, 10 Sep 2023 07:56:02 GMT) Full text and rfc822 format available.

Message #76 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: rudalics <at> gmx.at, monnier <at> iro.umontreal.ca, 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sun, 10 Sep 2023 10:55:26 +0300

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Sun, 10 Sep 2023 00:50:48 -0700
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 65726 <at> debbugs.gnu.org, rudalics <at> gmx.at, 
> 	65726-done <at> debbugs.gnu.org
> 
> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
> 
> >> Yes, with this patch the problem is gone.
> >
> > Thanks, pushed to `emacs-29`.
> 
> Is there anything more to do here, or can this bug report be closed?

I believe this is still being worked on and discussed by Stefan and
Mattias.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sun, 10 Sep 2023 23:10:02 GMT) Full text and rfc822 format available.

Message #79 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rudalics <at> gmx.at, Stefan Kangas <stefankangas <at> gmail.com>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sun, 10 Sep 2023 19:09:00 -0400

>> >> Yes, with this patch the problem is gone.
>> > Thanks, pushed to `emacs-29`.
>> Is there anything more to do here, or can this bug report be closed?
> I believe this is still being worked on and discussed by Stefan and
> Mattias.

I think we can close it.  What is being discussed is how to recover the
optimization that had to be significantly weakened to fix this patch,
but it was a new optimization in Emacs-29, so it's not urgent nor
a bug fix.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 11 Sep 2023 08:12:02 GMT) Full text and rfc822 format available.

Message #82 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 Stefan Kangas <stefankangas <at> gmail.com>, 65726 <at> debbugs.gnu.org
Subject: bug#65726: 29.1.50; Crash in regexp engine 
Date: Mon, 11 Sep 2023 10:10:50 +0200

> I think we can close it. What is being discussed is how to recover the optimization that had to be significantly weakened to fix this patch, but it was a new optimization in Emacs-29, so it's not urgent nor a bug fix.

Would you confirm that bug#61514 remains 'fixed' for practical purposes? I ran a few of the cases in that bug and it seems that we have them still covered, but I may have missed something.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 11 Sep 2023 13:42:02 GMT) Full text and rfc822 format available.

Message #85 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 Stefan Kangas <stefankangas <at> gmail.com>, 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 11 Sep 2023 09:41:36 -0400

> Would you confirm that bug#61514 remains 'fixed' for practical purposes?

I believe it does: the real fix was to use narrowing.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 11 Sep 2023 14:47:02 GMT) Full text and rfc822 format available.

Message #88 received at 65726-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>, Eli Zaretskii <eliz <at> gnu.org>
Cc: rudalics <at> gmx.at, 65726-done <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 11 Sep 2023 07:46:46 -0700

Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> I think we can close it.  What is being discussed is how to recover the
> optimization that had to be significantly weakened to fix this patch,
> but it was a new optimization in Emacs-29, so it's not urgent nor
> a bug fix.

Thanks, I'm therefore closing this bug report.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Thu, 14 Sep 2023 14:42:01 GMT) Full text and rfc822 format available.

Message #91 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Thu, 14 Sep 2023 10:41:30 -0400

[Message part 1 (text/plain, inline)]

> so yes, we may need to remember where we've been. (At this point
> someone will inevitably point out a helpful invariant that is obvious
> in hindsight. This is just my cunning attempt at making that happen.)

No helpful invariant that makes it trivial, but if we keep pushing the
same idea that we rely on the assumption that we just have
a syntactically nested loop nest, then we can handle that as in the
patch below.

I.e. keep the idea I proposed of keeping track of a beg..end region
that's already been handled.  But now we really do have to maintain both
ends (before, I only had `done` which kept track of the end of the
region), and just "restart" when we jump back to a point before the
"done" region.


        Stefan

[regexp.c.patch (text/x-diff, inline)]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 394ba22e9b0..a6b7faadf86 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3643,19 +3643,128 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte,
   return not;
 }
 
+/* Case where `p2` points to an `exactn`.  */
+static bool
+mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1,
+		           re_char *p2)
+{
+  bool multibyte = RE_MULTIBYTE_P (bufp);
+  int c
+    = (re_opcode_t) *p2 == endline ? '\n'
+      : RE_STRING_CHAR (p2 + 2, multibyte);
+
+  if ((re_opcode_t) *p1 == exactn)
+    {
+      if (c != RE_STRING_CHAR (p1 + 2, multibyte))
+	{
+	  DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
+	  return true;
+	}
+    }
+
+  else if ((re_opcode_t) *p1 == charset
+	   || (re_opcode_t) *p1 == charset_not)
+    {
+      if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+                            Qnil))
+	{
+	  DEBUG_PRINT ("	 No match => fast loop.\n");
+	  return true;
+	}
+    }
+  else if ((re_opcode_t) *p1 == anychar
+	   && c == '\n')
+    {
+      DEBUG_PRINT ("   . != \\n => fast loop.\n");
+      return true;
+    }
+  return false;
+}
+
+/* Case where `p2` points to an `charset`.  */
+static bool
+mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
+		            re_char *p2)
+{
+  /* It is hard to list up all the character in charset
+     P2 if it includes multibyte character.  Give up in
+     such case.  */
+  if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
+    {
+      /* Now, we are sure that P2 has no range table.
+	     So, for the size of bitmap in P2, 'p2[1]' is
+	     enough.  But P1 may have range table, so the
+	     size of bitmap table of P1 is extracted by
+	     using macro 'CHARSET_BITMAP_SIZE'.
+
+	     In a multibyte case, we know that all the character
+	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
+	     bitmap table.  So, in both cases, it is enough to test
+	     only the bitmap table of P1.  */
+
+      if ((re_opcode_t) *p1 == charset)
+	{
+	  int idx;
+	  /* We win if the charset inside the loop
+		 has no overlap with the one after the loop.  */
+	  for (idx = 0;
+		(idx < (int) p2[1]
+		 && idx < CHARSET_BITMAP_SIZE (p1));
+		idx++)
+	    if ((p2[2 + idx] & p1[2 + idx]) != 0)
+	      break;
+
+	  if (idx == p2[1]
+	      || idx == CHARSET_BITMAP_SIZE (p1))
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+      else if ((re_opcode_t) *p1 == charset_not)
+	{
+	  int idx;
+	  /* We win if the charset_not inside the loop lists
+		 every character listed in the charset after.  */
+	  for (idx = 0; idx < (int) p2[1]; idx++)
+	    if (! (p2[2 + idx] == 0
+		   || (idx < CHARSET_BITMAP_SIZE (p1)
+		       && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
+	      break;
+
+	  if (idx == p2[1])
+	    {
+	      DEBUG_PRINT ("	 No match => fast loop.\n");
+	      return true;
+	    }
+	}
+    }
+  return false;
+}
+
 /* True if "p1 matches something" implies "p2 fails".  */
+/* Avoiding inf-loops:
+   We're trying to follow all paths reachable from `p2`, but since some
+   loops can match the empty string, this can loop back to `p2`.
+   To avoid inf-looping, we keep track of points that have been considered
+   "already".  Instead of keeping a list of such points, we assume
+   that the code is a "sequential loop nest" and only keep track of
+   `done_beg` and `done_end` which delimit a chunk of bytecode we consider
+   as already considered.  */
 static bool
-mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
-		      re_char *p2)
+mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
+		        re_char *p2, re_char *done_beg, re_char *done_end)
 {
   re_opcode_t op2;
-  bool multibyte = RE_MULTIBYTE_P (bufp);
   unsigned char *pend = bufp->buffer + bufp->used;
   re_char *p2_orig = p2;
 
   eassert (p1 >= bufp->buffer && p1 < pend
 	   && p2 >= bufp->buffer && p2 <= pend);
 
+  eassert (done_beg <= done_end);
+  eassert (done_end <= p2);
+
   /* Skip over open/close-group commands.
      If what follows this loop is a ...+ construct,
      look at what begins its body, since we will have to
@@ -3684,98 +3793,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 
     case endline:
     case exactn:
-      {
-	int c
-	  = (re_opcode_t) *p2 == endline ? '\n'
-	  : RE_STRING_CHAR (p2 + 2, multibyte);
-
-	if ((re_opcode_t) *p1 == exactn)
-	  {
-	    if (c != RE_STRING_CHAR (p1 + 2, multibyte))
-	      {
-		DEBUG_PRINT ("  '%c' != '%c' => fast loop.\n", c, p1[2]);
-		return true;
-	      }
-	  }
-
-	else if ((re_opcode_t) *p1 == charset
-		 || (re_opcode_t) *p1 == charset_not)
-	  {
-	    if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
-                                  Qnil))
-	      {
-		DEBUG_PRINT ("	 No match => fast loop.\n");
-		return true;
-	      }
-	  }
-	else if ((re_opcode_t) *p1 == anychar
-		 && c == '\n')
-	  {
-	    DEBUG_PRINT ("   . != \\n => fast loop.\n");
-	    return true;
-	  }
-      }
-      break;
+      return mutually_exclusive_exactn (bufp, p1, p2);
 
     case charset:
       {
 	if ((re_opcode_t) *p1 == exactn)
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
-
-      /* It is hard to list up all the character in charset
-	 P2 if it includes multibyte character.  Give up in
-	 such case.  */
-      else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2))
-	{
-	  /* Now, we are sure that P2 has no range table.
-	     So, for the size of bitmap in P2, 'p2[1]' is
-	     enough.  But P1 may have range table, so the
-	     size of bitmap table of P1 is extracted by
-	     using macro 'CHARSET_BITMAP_SIZE'.
-
-	     In a multibyte case, we know that all the character
-	     listed in P2 is ASCII.  In a unibyte case, P1 has only a
-	     bitmap table.  So, in both cases, it is enough to test
-	     only the bitmap table of P1.  */
-
-	  if ((re_opcode_t) *p1 == charset)
-	    {
-	      int idx;
-	      /* We win if the charset inside the loop
-		 has no overlap with the one after the loop.  */
-	      for (idx = 0;
-		   (idx < (int) p2[1]
-		    && idx < CHARSET_BITMAP_SIZE (p1));
-		   idx++)
-		if ((p2[2 + idx] & p1[2 + idx]) != 0)
-		  break;
-
-	      if (idx == p2[1]
-		  || idx == CHARSET_BITMAP_SIZE (p1))
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	    }
-	  else if ((re_opcode_t) *p1 == charset_not)
-	    {
-	      int idx;
-	      /* We win if the charset_not inside the loop lists
-		 every character listed in the charset after.  */
-	      for (idx = 0; idx < (int) p2[1]; idx++)
-		if (! (p2[2 + idx] == 0
-		       || (idx < CHARSET_BITMAP_SIZE (p1)
-			   && ((p2[2 + idx] & ~ p1[2 + idx]) == 0))))
-		  break;
-
-	      if (idx == p2[1])
-		{
-		  DEBUG_PRINT ("	 No match => fast loop.\n");
-		  return true;
-		}
-	      }
-	  }
+	  return mutually_exclusive_exactn (bufp, p2, p1);
+	else
+	  return mutually_exclusive_charset (bufp, p1, p2);
       }
       break;
 
@@ -3783,9 +3808,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
       switch (*p1)
 	{
 	case exactn:
+	  return mutually_exclusive_exactn (bufp, p2, p1);
 	case charset:
-	  /* Reuse the code above.  */
-	  return mutually_exclusive_p (bufp, p2, p1);
+	  return mutually_exclusive_charset (bufp, p2, p1);
 	case charset_not:
 	  /* When we have two charset_not, it's very unlikely that
 	     they don't overlap.  The union of the two sets of excluded
@@ -3830,12 +3855,30 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
         int mcnt;
 	p2++;
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
-	/* Don't just test `mcnt > 0` because non-greedy loops have
-	   their test at the end with an unconditional jump at the start.  */
-	if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress.  */
-	  return (mutually_exclusive_p (bufp, p1, p2)
-		  && mutually_exclusive_p (bufp, p1, p2 + mcnt));
-	break;
+	re_char *p2_other = p2 + mcnt;
+
+	/* Presumably, any position in `done_beg..end` should be a position we
+	   already considered elsewhere (because our state machines are not
+	   arbitrary: they consists of syntaxically nested loops with limited
+	   control flow), so we can consider those back jumps as mutually
+	   exclusive here.
+	   This shows in the code below when `p3 > p2_orig`, i.e. when we jump
+	   forward: in that case we bump `done_end` up to `p3` under
+	   the assumption that any other reachable position
+	   between `done_end` and `p3` will be checked by the *other*
+	   call to RECURSE.
+	   The recursion should terminate because of a lexical ordering between
+	   `done_beg`, `done_end`, and `p2`:
+	   At each recursion, either `done_beg` gets smaller,
+	   or `done_beg` is unchanged and `done_end` gets larger
+	   or they're both unchanged and `p2` gets larger.  */
+#define RECURSE(p3)                                                \
+  ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
+   : (p3) <= done_end ? true                                       \
+   : mutually_exclusive_aux (bufp, p1, p3, done_beg,               \
+	                     (p3) > p2_orig ? done_end : (p3)))
+
+	return RECURSE (p2) && RECURSE (p2_other);
       }
 
     default:
@@ -3846,6 +3889,13 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
   return false;
 }
 
+static bool
+mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
+		      re_char *p2)
+{
+  re_char *done = min (p2, p1);
+  return mutually_exclusive_aux (bufp, p1, p2, done, done);
+}
 
 /* Matching routines.  */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Fri, 15 Sep 2023 20:05:02 GMT) Full text and rfc822 format available.

Message #94 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Fri, 15 Sep 2023 22:03:44 +0200

14 sep. 2023 kl. 16.41 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> I.e. keep the idea I proposed of keeping track of a beg..end region
> that's already been handled.  But now we really do have to maintain both
> ends (before, I only had `done` which kept track of the end of the
> region), and just "restart" when we jump back to a point before the
> "done" region.

Thank you, this looks workable and should enable some further optimisations. Another option would be to add the usual cycle detection, but this is probably faster.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Fri, 15 Sep 2023 22:21:01 GMT) Full text and rfc822 format available.

Message #97 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Fri, 15 Sep 2023 18:20:25 -0400

>> I.e. keep the idea I proposed of keeping track of a beg..end region
>> that's already been handled.  But now we really do have to maintain both
>> ends (before, I only had `done` which kept track of the end of the
>> region), and just "restart" when we jump back to a point before the
>> "done" region.
> Thank you, this looks workable and should enable some further
> optimisations.

I just pushed this to `master`.

> Another option would be to add the usual cycle detection,

That wouldn't be fun, tho, would it?

> but this is probably faster.

And more thrilling.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 16 Sep 2023 03:46:02 GMT) Full text and rfc822 format available.

Message #100 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Fri, 15 Sep 2023 23:45:33 -0400

>> so yes, we may need to remember where we've been. (At this point
>> someone will inevitably point out a helpful invariant that is obvious
>> in hindsight. This is just my cunning attempt at making that happen.)

I think I have a "helpful invariant", finally:

Because of how we build our code, when we're at an `on_failure_jump` the
two branches can either both go forward (typically for a "|" or a "?"),
or *one* of them goes backward the other forward (for loops), where the
one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it
`p2_loop`) that goes back to the beginning of the loop and the other
(call it `p2_exit`) is the one that exits the loop.

Now, because our loops are nested with proper "structured programming",
there can't be any jump from within the loop to outside the loop except
for the current jump.  And there can't be any jump from outside the loop
to inside the loop except by entering via `p2_loop`.

Since we have two recursive calls to `mutually_exclusive_p` (one for
`p2_exit` and one for `p2_loop`) and each one only needs to check those
positions not checked by the other, we can say that `p2_loop` only needs
to check the positions within the loop (i.e. between `p2_loop` and
`p2_exit`) and can presume that *all* other positions are checked by the
other recursive call (the one that starts at `p2_exit`).

So I think a single arg `done_end` (set, like the current `done_end`, to
`p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's
no way to go from `p2_loop` to before `p2_loop` without first going to
`p2_exit` (which is already checked by the other call).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 16 Sep 2023 10:51:02 GMT) Full text and rfc822 format available.

Message #103 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sat, 16 Sep 2023 12:49:58 +0200

16 sep. 2023 kl. 05.45 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

>>> (At this point
>>> someone will inevitably point out a helpful invariant that is obvious
>>> in hindsight. This is just my cunning attempt at making that happen.)
> 
> I think I have a "helpful invariant", finally:

Everything is proceeding as I have foreseen.

> Because of how we build our code, when we're at an `on_failure_jump` the
> two branches can either both go forward (typically for a "|" or a "?"),
> or *one* of them goes backward the other forward (for loops), where the
> one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it
> `p2_loop`) that goes back to the beginning of the loop and the other
> (call it `p2_exit`) is the one that exits the loop.
> 
> Now, because our loops are nested with proper "structured programming",
> there can't be any jump from within the loop to outside the loop except
> for the current jump.  And there can't be any jump from outside the loop
> to inside the loop except by entering via `p2_loop`.
> 
> Since we have two recursive calls to `mutually_exclusive_p` (one for
> `p2_exit` and one for `p2_loop`) and each one only needs to check those
> positions not checked by the other, we can say that `p2_loop` only needs
> to check the positions within the loop (i.e. between `p2_loop` and
> `p2_exit`) and can presume that *all* other positions are checked by the
> other recursive call (the one that starts at `p2_exit`).
> 
> So I think a single arg `done_end` (set, like the current `done_end`, to
> `p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's
> no way to go from `p2_loop` to before `p2_loop` without first going to
> `p2_exit` (which is already checked by the other call).

I think you are right, but wouldn't mind seeing it confirmed empirically. Say, by cross-checking using an an alternative (slower) implementation that directly checks whether a node has been visited before.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 16 Sep 2023 15:49:02 GMT) Full text and rfc822 format available.

Message #106 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sat, 16 Sep 2023 11:48:02 -0400

> I think you are right, but wouldn't mind seeing it confirmed
> empirically. Say, by cross-checking using an an alternative (slower)
> implementation that directly checks whether a node has been visited before.

I'm reworking my code so it takes 2 arguments: the loop-entry and the
loop-exit.  Any jump outside of those bounds should never happen so we
can assert it to check that our assumptions are right (and we can
return false when we don't compile assertions, so it's never over-optimistic).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 18 Sep 2023 02:15:02 GMT) Full text and rfc822 format available.

Message #109 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sun, 17 Sep 2023 22:14:37 -0400

[Message part 1 (text/plain, inline)]

> I'm reworking my code so it takes 2 arguments: the loop-entry and the
> loop-exit.  Any jump outside of those bounds should never happen so we
> can assert it to check that our assumptions are right (and we can
> return false when we don't compile assertions, so it's never over-optimistic).

Hmm... not quite working yet.  With the patch below I get:

    % make test/src/regex-emacs-tests
    [...]
    passed  30/34  regexp-multibyte-unibyte (0.000264 sec)
    0:      /on_failure_jump_smart to 9
    3:      /exactn/1/x
    6:      /jump to 0
    9:      /start_memory/1
    11:     /on_failure_jump to 26
    14:     /on_failure_jump_smart to 23
    17:     /exactn/1/=
    20:     /jump to 14
    23:     /jump to 29
    26:     /exactn/1/h
    29:     /stop_memory/1
    31:     /on_failure_jump_loop to 37
    34:     /jump to 9
    37:     /succeed
    38:     end of pattern.
    38 bytes used/128 bytes allocated.
    re_nsub: 1      regs_alloc: 1   can_be_null: 0
    p1==3, p2==34, loop-entry==14, loop-exit==26
    Test regexp-tests-backtrack-optimization backtrace:
      looking-at("x*\\(=*\\|h\\)+")
    [...]

The problem is that my code sees the `jump 9` (near the end) followed by
`start_memory` and `on_failure_jump to 26` as a a backward jump that
defines a loop whose body starts at 14 and whose exit point is at 26,
but that 14..26 is not a loop, it's a `|` and those don't obey the
assumption I made about the exit point (the 2 destinations of such an
`on_failure_jump` correspond to the first and the second patterns of the
`|`, i.e. entry&middle rather than entry&exit, so there *can* be
jumps straight out from the first point without going through the second
point 🙁).

BTW, here's an indented version of the code:

    0:      /on_failure_jump_smart to 9 {
    3:        /exactn/1/x
    6:      } /jump to 0
    9:      { /start_memory/1
    11:       /on_failure_jump to 26
    14:       /on_failure_jump_smart to 23 {
    17:         /exactn/1/=
    20:       } /jump to 14
    23:       /jump to 29
    26:       /exactn/1/h
    29:       /stop_memory/1
    31:       /on_failure_jump_loop to 37
    34:     } /jump to 9
    37:     /succeed
    38:     end of pattern.

Another problem we see here is that the main (second) loop spans 9..37
and is controlled by the `/on_failure_jump_loop to 37` but the two
destination of the OFJ are not 9 and 37 but 34 and 37 🙂.

So maybe we should `skip_noops` before looking at the destinations to
decide if it's a loop?


        Stefan

[regexp.c.patch (text/x-diff, inline)]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 55d0a6e8df8..c54f9f81890 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3068,8 +3068,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
 	  continue;
 
 	case succeed_n:
-	  /* If N == 0, it should be an on_failure_jump_loop instead.  */
-	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
+	  /* If N == 0, it should be an on_failure_jump_loop instead.
+	     `j` can be negative because `EXTRACT_NUMBER` extracts a
+	     signed number whereas `succeed_n` treats it as unsigned.  */
+	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));
 	  p += 4;
 	  /* We only care about one iteration of the loop, so we don't
 	     need to consider the case where this behaves like an
@@ -3743,20 +3745,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
 }
 
 /* True if "p1 matches something" implies "p2 fails".  */
-/* Avoiding inf-loops:
-   We're trying to follow all paths reachable from `p2`, but since some
+/* We're trying to follow all paths reachable from `p2`, but since some
    loops can match the empty string, this can loop back to `p2`.
-   To avoid inf-looping, we keep track of points that have been considered
-   "already".  Instead of keeping a list of such points, `done_beg` and
-   `done_end` delimit a chunk of bytecode we already considered.
-   To guarantee termination, a lexical ordering between `done_*` and `p2`
-   should be obeyed:
-       At each recursion, either `done_beg` gets smaller,
-       or `done_beg` is unchanged and `done_end` gets larger
-       or they're both unchanged and `p2` gets larger.  */
+   To avoid inf-looping, we take advantage of the fact that
+   the bytecode we generate is made of syntactically nested loops, more
+   specifically, every loop has a single entry point and single exit point.
+   The function takes 2 more arguments (`loop_entry` and `loop_exit`).
+   `loop_entry` points to the sole entry point of the current loop and
+   `loop_exit` points to its sole exit point (when non-NULL).
+   The function can assume that `loop_exit` is "mutually exclusive".
+   Termination of recursive calls is ensured because either `loop_entry`
+   is larger, or it's equal but `p2` is larger.  */
 static bool
 mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
-		        re_char *p2, re_char *done_beg, re_char *done_end)
+		        re_char *p2, re_char *loop_entry, re_char *loop_exit)
 {
   re_opcode_t op2;
   unsigned char *pend = bufp->buffer + bufp->used;
@@ -3765,8 +3767,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
   eassert (p1 >= bufp->buffer && p1 < pend
 	   && p2 >= bufp->buffer && p2 <= pend);
 
-  eassert (done_beg <= done_end);
-  eassert (done_end <= p2);
+  if (p2 == loop_exit)
+    /* Presumably already checked elsewhere.  */
+    return true;
+  eassert (loop_entry && p2 >= loop_entry);
+  /* eassert (!loop_exit  || p2 < loop_exit); */
+  if (p2 < loop_entry)
+    return false;
+  if (loop_exit && p2 > loop_exit)
+    {
+      void print_compiled_pattern (struct re_pattern_buffer *bufp);
+      print_compiled_pattern (bufp);
+      fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
+               p1 - bufp->buffer, p2 - bufp->buffer,
+               loop_entry - bufp->buffer, loop_exit - bufp->buffer);
+      error ("WOW1!");
+    }
+    /* return false; */
 
   /* Skip over open/close-group commands.
      If what follows this loop is a ...+ construct,
@@ -3860,27 +3877,30 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
 	re_char *p2_other = p2 + mcnt;
 
-	/* When we jump backward we bump `done_end` up to `p3` under
-	   the assumption that any other position between `done_end`
-	   and `p3` is either:
-           - checked by the other call to RECURSE.
-           - not reachable from here (e.g. for positions before the
-             `on_failure_jump`), or at least not without first
-             jumping before `done_beg`.
-           This should hold because our state machines are not arbitrary:
-           they consists of syntaxically nested loops with limited
-	   control flow.
-	   FIXME: This can fail (i.e. return true when it shouldn't)
-	   if we start generating bytecode with a different shape,
-	   so maybe we should bite the bullet and replace done_beg/end
-	   with an actual list of positions we've already processed.  */
-#define RECURSE(p3)                                                \
-  ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
-   : (p3) <= done_end ? true                                       \
-   : mutually_exclusive_aux (bufp, p1, p3, done_beg,               \
-	                     (p3) > p2_orig ? done_end : (p3)))
-
-	return RECURSE (p2) && RECURSE (p2_other);
+	/* If we jump backward to the entry point of the current loop
+	   it means it's a zero-length cycle through that loop, so
+	   this cycle itself does not break mutual-exclusion.
+	   If we jump backward to a new loop nested within the current one
+	   then the `p2_other` points to the exit of that inner loop
+	   and the other RECURSE will check what happens when we leave
+	   the loop so we can focus on checking just the loop body,
+	   so we pass new values for `loop-entry` and `loop_exit`.
+	   In all other cases, we just recurse "normally": if it's before
+	   `loop_entry` it's a bug that will be caught by checks at
+	   the entrace of the function and otherwise it's a forward jump
+	   which doesn't need any special treatment.
+	   FIXME: This is failsafe (can't return true when it shouldn't)
+	   but it could be too conservative if we start generating bytecode
+	   with a different shape, so maybe we should bite the bullet and
+	   replace done_beg/end with an actual list of positions we've
+	   already processed.  */
+#define RECURSE(p3, p3_other)                            \
+  ((p3) == loop_entry ? true                             \
+   : loop_entry < (p3) && (p3) < p2_orig ?               \
+     mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \
+   : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit))
+        
+	return RECURSE (p2, p2_other) && RECURSE (p2_other, p2);
       }
 
     default:
@@ -3895,7 +3915,7 @@ #define RECURSE(p3)                                                \
 mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 		      re_char *p2)
 {
-  return mutually_exclusive_aux (bufp, p1, p2, p2, p2);
+  return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL);
 }
 
 /* Matching routines.  */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 18 Sep 2023 04:00:03 GMT) Full text and rfc822 format available.

Message #112 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sun, 17 Sep 2023 23:59:32 -0400

> So maybe we should `skip_noops` before looking at the destinations to
> decide if it's a loop?

Nah, just introduces other problems :-(

BTW maybe we should replace

    31:     /on_failure_jump_loop to 37
    34:     /jump to 9

with

    31:     /on_failure_dont_jump_loop to 9

It's quite funny actually: with the current bytecodes, every iteration
of a backtracking greedy loop runs `jump + on_failure_jump`, whereas for
a non-greedy loop every iteration runs just a single `on_failure_jump`.
Yet those bytecodes predate the introduction of non-greedy loops :-)

[ Of course, there are reasons for it, beside historical ones: it's harder
  to rewrite a loop that ends with `on_failure_dont_jump_smart` into one
  using `on_failure_keep_string_jump` :-(  ]


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Mon, 18 Sep 2023 12:34:01 GMT) Full text and rfc822 format available.

Message #115 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Mon, 18 Sep 2023 08:32:40 -0400

[Message part 1 (text/plain, inline)]

> BTW maybe we should replace
>
>     31:     /on_failure_jump_loop to 37
>     34:     /jump to 9
>
> with
>
>     31:     /on_failure_dont_jump_loop to 9

In the mean time the patch below recognizes the above pattern, which
gives me code which is safe (it should both always terminate and should
never apply the optimization if it's not valid) yet does find the
optimization in all the cases where I expected it.


        Stefan

[regexp.c.patch (text/x-diff, inline)]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 55d0a6e8df8..e45d0f19dbf 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1923,12 +1923,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 		    if (!zero_times_ok && simple)
 		      { /* Since simple * loops can be made faster by using
-			   on_failure_keep_string_jump, we turn simple P+
-			   into PP* if P is simple.  */
+			   on_failure_keep_string_jump, we turn P+ into PP*
+                           if P is simple.
+			   We can't use `top: <BODY>; OFJS exit; J top; exit:`
+			   Because the OFJS needs to be at the beginning
+			   since we want to be able to replace
+			   `top: OFJS exit; <BODY>; J top; exit`
+			   with
+			   `top: OFKSJ exit; loop: <BODY>; J loop; exit`,
+			   i.e. a single OFJ at the beginning of the loop
+			   rather than once per iteration.  */
 			unsigned char *p1, *p2;
 			startoffset = b - laststart;
 			GET_BUFFER_SPACE (startoffset);
 			p1 = b; p2 = laststart;
+			/* This presumes that the code skipped
+			   by `skip_one_char` is position-independent.  */
 			while (p2 < p1)
 			  *b++ = *p2++;
 			zero_times_ok = 1;
@@ -3068,8 +3078,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
 	  continue;
 
 	case succeed_n:
-	  /* If N == 0, it should be an on_failure_jump_loop instead.  */
-	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
+	  /* If N == 0, it should be an on_failure_jump_loop instead.
+	     `j` can be negative because `EXTRACT_NUMBER` extracts a
+	     signed number whereas `succeed_n` treats it as unsigned.  */
+	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));
 	  p += 4;
 	  /* We only care about one iteration of the loop, so we don't
 	     need to consider the case where this behaves like an
@@ -3743,20 +3755,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1,
 }
 
 /* True if "p1 matches something" implies "p2 fails".  */
-/* Avoiding inf-loops:
-   We're trying to follow all paths reachable from `p2`, but since some
+/* We're trying to follow all paths reachable from `p2`, but since some
    loops can match the empty string, this can loop back to `p2`.
-   To avoid inf-looping, we keep track of points that have been considered
-   "already".  Instead of keeping a list of such points, `done_beg` and
-   `done_end` delimit a chunk of bytecode we already considered.
-   To guarantee termination, a lexical ordering between `done_*` and `p2`
-   should be obeyed:
-       At each recursion, either `done_beg` gets smaller,
-       or `done_beg` is unchanged and `done_end` gets larger
-       or they're both unchanged and `p2` gets larger.  */
+   To avoid inf-looping, we take advantage of the fact that
+   the bytecode we generate is made of syntactically nested loops, more
+   specifically, every loop has a single entry point and single exit point.
+   The function takes 2 more arguments (`loop_entry` and `loop_exit`).
+   `loop_entry` points to the sole entry point of the current loop and
+   `loop_exit` points to its sole exit point (when non-NULL).
+   The function can assume that `loop_exit` is "mutually exclusive".
+   Termination of recursive calls is ensured because either `loop_entry`
+   is larger, or it's equal but `p2` is larger.  */
 static bool
 mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
-		        re_char *p2, re_char *done_beg, re_char *done_end)
+		        re_char *p2, re_char *loop_entry, re_char *loop_exit)
 {
   re_opcode_t op2;
   unsigned char *pend = bufp->buffer + bufp->used;
@@ -3765,8 +3777,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
   eassert (p1 >= bufp->buffer && p1 < pend
 	   && p2 >= bufp->buffer && p2 <= pend);
 
-  eassert (done_beg <= done_end);
-  eassert (done_end <= p2);
+  if (p2 == loop_exit)
+    /* Presumably already checked elsewhere.  */
+    return true;
+  eassert (loop_entry && p2 >= loop_entry);
+  /* eassert (!loop_exit  || p2 < loop_exit); */
+  if (p2 < loop_entry)
+    return false;
+  if (loop_exit && p2 > loop_exit)
+    {
+      void print_compiled_pattern (struct re_pattern_buffer *bufp);
+      print_compiled_pattern (bufp);
+      fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
+               p1 - bufp->buffer, p2 - bufp->buffer,
+               loop_entry - bufp->buffer, loop_exit - bufp->buffer);
+      error ("WOW1!");
+    }
+    /* return false; */
 
   /* Skip over open/close-group commands.
      If what follows this loop is a ...+ construct,
@@ -3859,28 +3886,40 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1,
 	p2++;
 	EXTRACT_NUMBER_AND_INCR (mcnt, p2);
 	re_char *p2_other = p2 + mcnt;
+	/* For `+` loops, we often have an `on_failure_jump` that skips forward
+	   over a subsequent `jump` for lack of an `on_failure_dont_jump`
+	   kind of thing.  Recognize this pattern since that subsequent
+	   `jump` is the one that jumps to the loop-entry.  */
+	if ((re_opcode_t) p2[0] == jump && mcnt == 3)
+	  {
+	    EXTRACT_NUMBER (mcnt, p2 + 1);
+	    p2 += mcnt + 3;
+	  }
 
-	/* When we jump backward we bump `done_end` up to `p3` under
-	   the assumption that any other position between `done_end`
-	   and `p3` is either:
-           - checked by the other call to RECURSE.
-           - not reachable from here (e.g. for positions before the
-             `on_failure_jump`), or at least not without first
-             jumping before `done_beg`.
-           This should hold because our state machines are not arbitrary:
-           they consists of syntaxically nested loops with limited
-	   control flow.
-	   FIXME: This can fail (i.e. return true when it shouldn't)
-	   if we start generating bytecode with a different shape,
-	   so maybe we should bite the bullet and replace done_beg/end
-	   with an actual list of positions we've already processed.  */
-#define RECURSE(p3)                                                \
-  ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \
-   : (p3) <= done_end ? true                                       \
-   : mutually_exclusive_aux (bufp, p1, p3, done_beg,               \
-	                     (p3) > p2_orig ? done_end : (p3)))
-
-	return RECURSE (p2) && RECURSE (p2_other);
+	/* If we jump backward to the entry point of the current loop
+	   it means it's a zero-length cycle through that loop, so
+	   this cycle itself does not break mutual-exclusion.
+	   If we jump backward to a new loop nested within the current one
+	   then the `p2_other` points to the exit of that inner loop
+	   and the other RECURSE will check what happens when we leave
+	   the loop so we can focus on checking just the loop body,
+	   so we pass new values for `loop-entry` and `loop_exit`.
+	   In all other cases, we just recurse "normally": if it's before
+	   `loop_entry` it's a bug that will be caught by checks at
+	   the entrace of the function and otherwise it's a forward jump
+	   which doesn't need any special treatment.
+	   FIXME: This is failsafe (can't return true when it shouldn't)
+	   but it could be too conservative if we start generating bytecode
+	   with a different shape, so maybe we should bite the bullet and
+	   replace done_beg/end with an actual list of positions we've
+	   already processed.  */
+#define RECURSE(p3, p3_other)                                      \
+  ((p3) == loop_entry ? true                                       \
+   : loop_entry < (p3) && (p3) < p2_orig && (p3_other) > p2_orig ? \
+     mutually_exclusive_aux (bufp, p1, p3, p3, p3_other)           \
+   : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit))
+
+	return RECURSE (p2, p2_other) && RECURSE (p2_other, p2);
       }
 
     default:
@@ -3895,7 +3934,7 @@ #define RECURSE(p3)                                                \
 mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 		      re_char *p2)
 {
-  return mutually_exclusive_aux (bufp, p1, p2, p2, p2);
+  return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL);
 }
 
 /* Matching routines.  */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Thu, 21 Sep 2023 17:25:01 GMT) Full text and rfc822 format available.

Message #118 received at 65726 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726 <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Thu, 21 Sep 2023 19:23:45 +0200

18 sep. 2023 kl. 14.32 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> In the mean time the patch below recognizes the above pattern, which
> gives me code which is safe (it should both always terminate and should
> never apply the optimization if it's not valid) yet does find the
> optimization in all the cases where I expected it.

Yes, maybe this would work but the ad-hocness of it all means that I very much could have missed something (and as you comment, that it's brittle).

> 	case succeed_n:
> -	  /* If N == 0, it should be an on_failure_jump_loop instead.  */
> -	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0));
> +	  /* If N == 0, it should be an on_failure_jump_loop instead.
> +	     `j` can be negative because `EXTRACT_NUMBER` extracts a
> +	     signed number whereas `succeed_n` treats it as unsigned.  */
> +	  DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0));

So this is what prevented me from running with counts above 2**15 and debugging enabled for all these years...

> +  if (loop_exit && p2 > loop_exit)
> +    {
> +      void print_compiled_pattern (struct re_pattern_buffer *bufp);
> +      print_compiled_pattern (bufp);
> +      fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n",
> +               p1 - bufp->buffer, p2 - bufp->buffer,
> +               loop_entry - bufp->buffer, loop_exit - bufp->buffer);
> +      error ("WOW1!");

#ifdef REGEX_EMACS_DEBUG ?

> +    }
> +    /* return false; */

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Thu, 21 Sep 2023 18:10:01 GMT) Full text and rfc822 format available.

Message #121 received at 65726-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726-done <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Thu, 21 Sep 2023 14:08:58 -0400

> So this is what prevented me from running with counts above 2**15 and
> debugging enabled for all these years...

Looks like it.

In any case, I pushed a variant of the last patch to `master` and
I think the case is now closed.  We may want to revisit it, presumably
by really keeping track of the all visited nodes, but this seems to work
well for now and I'm confident that it is safe (regardless of the shape
of the bytecode, it should always terminate and it should never apply
the optimization if it's incorrect, and with the shape of the bytecode
we currently produce it does correctly apply the optimization in the cases
I checked).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#65726; Package emacs. (Sat, 23 Sep 2023 11:58:02 GMT) Full text and rfc822 format available.

Message #124 received at 65726-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, Eli Zaretskii <eliz <at> gnu.org>,
 65726-done <at> debbugs.gnu.org
Subject: Re: bug#65726: 29.1.50; Crash in regexp engine
Date: Sat, 23 Sep 2023 13:56:51 +0200

21 sep. 2023 kl. 20.08 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> In any case, I pushed a variant of the last patch to `master` and
> I think the case is now closed.  We may want to revisit it, presumably
> by really keeping track of the all visited nodes, but this seems to work
> well for now and I'm confident that it is safe (regardless of the shape
> of the bytecode, it should always terminate and it should never apply
> the optimization if it's incorrect, and with the shape of the bytecode
> we currently produce it does correctly apply the optimization in the cases
> I checked).

Good, thank you.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 22 Oct 2023 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 201 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #65726 29.1.50; Crash in regexp engine

GNU bug report logs - #65726
29.1.50; Crash in regexp engine