GNU bug report logs - #77392
‘regexp-exec’ gets match boundaries wrong for multibyte strings

Previous Next

Package: guile;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Sun, 30 Mar 2025 20:55:02 UTC

Severity: normal

To reply to this bug, email your comments to 77392 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#77392; Package guile. (Sun, 30 Mar 2025 20:55:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludo <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sun, 30 Mar 2025 20:55:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: bug-guile <at> gnu.org
Subject: ‘regexp-exec’ gets match boundaries
 wrong for multibyte strings
Date: Sun, 30 Mar 2025 22:54:24 +0200
[Message part 1 (text/plain, inline)]
‘regexp-exec’ sometimes gets match boundaries wrong when operating on a
Unicode string but in a C locale (this is with
af96820e072d18c49ac03e80c6f3466d568dc77d):

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (setlocale LC_ALL "C")
$52 = "C"
scheme@(guile-user)> (string-match "start (.*)"
				   (string-append "start "
						   (string (integer->char 1002))))
$53 = #("start \u03ea" (0 . 8) (6 . 8))
scheme@(guile-user)> (match:substring $53 1)
ice-9/boot-9.scm:1683:22: In procedure raise-exception:
Value out of range 6 to< 7: 8

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
--8<---------------cut here---------------end--------------->8---

The attached program produces more failures at random.  (The example
above works well under a UTF-8 locale.)

So I believe ‘fixup_multibyte_match’ isn’t quite correct.

Ludo’.

PS: This originates in <https://issues.guix.gnu.org/77283>.

[regexp-unicode-ascii.scm (text/plain, inline)]
(use-modules (ice-9 regex))

(define rx
  (make-regexp "^start (.*)"))

(setlocale LC_ALL "C")
(let loop ()
  (let* ((i (+ 256 (random (expt 2 10))))
         (str (string-append "start " (string (integer->char i)))))
    (with-exception-handler
        (lambda (exc)
          (pk 'exc exc '<-- i)
          (display-backtrace (make-stack #t) (current-error-port))
          (exit 1))
      (lambda ()
        (match:substring (regexp-exec rx str) 1)))
    (loop)))

This bug report was last modified 5 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.