GNU bug report logs - #30116
[PATCH] `substitute' crashes when file contains NUL characters (core-updates)

Previous Next

Package: guix;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Mon, 15 Jan 2018 01:29:02 UTC

Severity: normal

Tags: patch

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30116 in the body.
You can then email your comments to 30116 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Mon, 15 Jan 2018 01:29:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Mon, 15 Jan 2018 01:29:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: bug-guix <bug-guix <at> gnu.org>
Subject: [PATCH] `substitute' crashes when file contains NUL characters
 (core-updates)
Date: Sun, 14 Jan 2018 20:27:57 -0500
Hello,

I've encountered the following crash when trying to use substitute on a
file which contains NUL characters:

--8<---------------cut here---------------start------------->8---
(define problematic-file "/tmp/bp-image-data.el")
scheme@(guix build utils)> ,m (guix build utils)
scheme@(guix build utils)> (substitute* problematic-file
			     (("toto") "tata"))
ice-9/boot-9.scm:752:25: In procedure dispatch-exception:
string contains #\nul character: "\"II*\x00(\x03\x00\x00����������@@@@����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����BBBB��������������������@@@@����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����BBBB������������\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����BBBB����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x10\x10\x10\x10����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x10\x10\x10\x10����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x10\x10\x10\x10����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����>>>>����<<<<����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����>>>>��������������������<<<<����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x01����\x01\x01\x01\x01����\x04\x04\x04\x04����>>>>������������������������������������<<<<����\x0f\x0f\x0f\x0f����\x0f\x0f\x0f\x0f����\x0f\x0f\x0f\x0f����>>>>��������������������������\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n"

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guix build utils) [1]> ,bt
In ice-9/boot-9.scm:
    841:4  9 (with-throw-handler _ _ _)
In ice-9/ports.scm:
   444:17  8 (call-with-input-file _ _ #:binary _ #:encoding _ #:guess-encoding _)
In guix/build/utils.scm:
   609:26  7 (_ _)
   635:26  6 (_ #<input: /tmp/bp-image-data.el 14> #<input-output: /tmp/bp-image-data.el.qVytzo 13>)
In srfi/srfi-1.scm:
   466:18  5 (fold #<procedure 7f29b8929520 at guix/build/utils.scm:635:32 (r+p line)> "\"II*\x00(\x03\x00\x00���…" …)
In guix/build/utils.scm:
   638:37  4 (_ _ "\"II*\x00(\x03\x00\x00����������@@@@����\x04\x04\x04\x04����\x01\x01\x01\x01����\x01\x01\x01\x0…")
In ice-9/regex.scm:
   189:12  3 (list-matches _ _ _)
   177:19  2 (fold-matches _ "\"II*\x00(\x03\x00\x00����������@@@@����\x04\x04\x04\x04����\x01\x01\x01\x01����\x0…" …)
In unknown file:
           1 (regexp-exec #<regexp 51f3bc0> "\"II*\x00(\x03\x00\x00����������@@@@����\x04\x04\x04\x04����\x01\x01…" …)
In ice-9/boot-9.scm:
   752:25  0 (dispatch-exception _ _ _)
--8<---------------cut here---------------end--------------->8---

That file comes from emacs-realgud, which I'm attempting to package:
https://github.com/realgud/realgud/blob/master/realgud/common/bp-image-data.el.

This was discovered when the patch-el-files phase of the
emacs-build-system crashed as above when it called substitute*.

Patch to follow.

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Mon, 15 Jan 2018 01:39:02 GMT) Full text and rfc822 format available.

Message #8 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: 30116 <at> debbugs.gnu.org
Subject: [PATCH] `substitute' crashes when file contains NUL characters
 (core-updates))
Date: Sun, 14 Jan 2018 20:38:22 -0500
[0001-utils-Prevent-substitute-from-crashing-on-files-cont.patch (text/x-patch, attachment)]
From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Date: Sun, 14 Jan 2018 20:31:33 -0500
Subject: [PATCH] utils: Prevent substitute from crashing on files containing
 NUL chars.

Fixes issue #30116.

* guix/build/utils.scm (substitute): Add condition to skip lines containing
the NUL character.
---
 guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/guix/build/utils.scm b/guix/build/utils.scm
index 7391307c8..975f4e70a 100644
--- a/guix/build/utils.scm
+++ b/guix/build/utils.scm
@@ -3,6 +3,7 @@
 ;;; Copyright © 2013 Andreas Enge <andreas <at> enge.fr>
 ;;; Copyright © 2013 Nikita Karetnikov <nikita <at> karetnikov.org>
 ;;; Copyright © 2015 Mark H Weaver <mhw <at> netris.org>
+;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
 a substitution of the original line.  Be careful about using '$' to match the
 end of a line; by itself it won't match the terminating newline of a line."
   (let ((rx+proc  (map (match-lambda
-                        (((? regexp? pattern) . proc)
-                         (cons pattern proc))
-                        ((pattern . proc)
-                         (cons (make-regexp pattern regexp/extended)
-                               proc)))
+                         (((? regexp? pattern) . proc)
+                          (cons pattern proc))
+                         ((pattern . proc)
+                          (cons (make-regexp pattern regexp/extended)
+                                proc)))
                        pattern+procs)))
     (with-atomic-file-replacement file
       (lambda (in out)
         (let loop ((line (read-line in 'concat)))
-          (if (eof-object? line)
-              #t
-              (let ((line (fold (lambda (r+p line)
-                                  (match r+p
-                                    ((regexp . proc)
-                                     (match (list-matches regexp line)
-                                       ((and m+ (_ _ ...))
-                                        (proc line m+))
-                                       (_ line)))))
-                                line
-                                rx+proc)))
-                (display line out)
-                (loop (read-line in 'concat)))))))))
+          (cond
+           ((eof-object? line)
+            #t)
+           ((string-contains line (make-string 1 #\nul))
+            ;; The regexp functions of the GNU C library (which Guile uses)
+            ;; cannot deal with NUL characters, so skip to the next line.
+            (format #t "skipping line with NUL characters: ~s\n" line)
+            (loop (read-line in 'concat)))
+           (else
+            (let ((line (fold (lambda (r+p line)
+                                (match r+p
+                                  ((regexp . proc)
+                                   (match (list-matches regexp line)
+                                     ((and m+ (_ _ ...))
+                                      (proc line m+))
+                                     (_ line)))))
+                              line
+                              rx+proc)))
+              (display line out)
+              (loop (read-line in 'concat))))))))))
 
 
 (define-syntax let-matches
-- 
2.15.1





Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Tue, 16 Jan 2018 11:24:02 GMT) Full text and rfc822 format available.

Message #11 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Tue, 16 Jan 2018 12:23:17 +0100
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> I've encountered the following crash when trying to use substitute on a
> file which contains NUL characters:

Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
which does not handle NULs.

We should consider switching to the pure-Scheme SRFI-115:

  https://srfi.schemers.org/srfi-115/srfi-115.html

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Wed, 17 Jan 2018 14:38:01 GMT) Full text and rfc822 format available.

Message #14 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Wed, 17 Jan 2018 15:37:54 +0100
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
> Date: Sun, 14 Jan 2018 20:31:33 -0500
> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>  NUL chars.
>
> Fixes issue #30116.
>
> * guix/build/utils.scm (substitute): Add condition to skip lines containing
> the NUL character.

[...]

> +           ((string-contains line (make-string 1 #\nul))

Rather (string-index line #\nul).

> +            ;; The regexp functions of the GNU C library (which Guile uses)
> +            ;; cannot deal with NUL characters, so skip to the next line.
> +            (format #t "skipping line with NUL characters: ~s\n" line)
> +            (loop (read-line in 'concat)))

Rather (format (current-error-port) …).

It’s strange semantics, but it’s probably better than crashing in the
contexts where we use it.

Otherwise LGTM.  This would have to go to the next ‘core-updates’ (or
‘core-updates-next’ in the meantime.)

Thanks!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Sun, 21 Jan 2018 04:25:02 GMT) Full text and rfc822 format available.

Message #17 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Sat, 20 Jan 2018 23:24:34 -0500
ludo <at> gnu.org (Ludovic Courtès) writes:

> Hi,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> I've encountered the following crash when trying to use substitute on a
>> file which contains NUL characters:
>
> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
> which does not handle NULs.
>
> We should consider switching to the pure-Scheme SRFI-115:
>
>   https://srfi.schemers.org/srfi-115/srfi-115.html

This looks good, and I started looking into porting `substitute' to it,
but quickly noticed it doesn't seem to be implemented in Guile yet?

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Sun, 21 Jan 2018 18:19:02 GMT) Full text and rfc822 format available.

Message #20 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Sun, 21 Jan 2018 13:17:45 -0500
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> ludo <at> gnu.org (Ludovic Courtès) writes:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>>
>>> I've encountered the following crash when trying to use substitute on a
>>> file which contains NUL characters:
>>
>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>> which does not handle NULs.
>>
>> We should consider switching to the pure-Scheme SRFI-115:
>>
>>   https://srfi.schemers.org/srfi-115/srfi-115.html
>
> This looks good, and I started looking into porting `substitute' to it,
> but quickly noticed it doesn't seem to be implemented in Guile yet?

Indeed.  SRFI-115 for Guile is on my TODO list, although it might be
better to wait until after we switch to using UTF-8 encoding internally
for strings, since that will drastically affect the implementation of
any efficient regexp matcher on Scheme strings.

Anyway, 'substitute*' is to be used only on text files, and NUL bytes
are not a valid textual character.  So, I think that this case is
outside of what 'substitute*' is meant to do, and therefore not a bug in
'substitute*', although of course a more graceful error would surely be
preferable.

What do you think?

      Mark




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Mon, 22 Jan 2018 10:59:02 GMT) Full text and rfc822 format available.

Message #23 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 30116 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Mon, 22 Jan 2018 11:58:37 +0100
Mark H Weaver <mhw <at> netris.org> skribis:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> ludo <at> gnu.org (Ludovic Courtès) writes:
>>
>>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>>>
>>>> I've encountered the following crash when trying to use substitute on a
>>>> file which contains NUL characters:
>>>
>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>>> which does not handle NULs.
>>>
>>> We should consider switching to the pure-Scheme SRFI-115:
>>>
>>>   https://srfi.schemers.org/srfi-115/srfi-115.html
>>
>> This looks good, and I started looking into porting `substitute' to it,
>> but quickly noticed it doesn't seem to be implemented in Guile yet?

ISTR that the reference implementation works fine on Guile.

> Indeed.  SRFI-115 for Guile is on my TODO list, although it might be
> better to wait until after we switch to using UTF-8 encoding internally
> for strings, since that will drastically affect the implementation of
> any efficient regexp matcher on Scheme strings.

Indeed, though I suppose it doesn’t matter much for the cases where
‘substitute*’ is used?

> Anyway, 'substitute*' is to be used only on text files, and NUL bytes
> are not a valid textual character.  So, I think that this case is
> outside of what 'substitute*' is meant to do, and therefore not a bug in
> 'substitute*', although of course a more graceful error would surely be
> preferable.

Yes, that’s also a good point.

So yeah, I think it may be good “eventually” to switch to SRFI-115, but
that’s not urgent.

Thoughts?

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Tue, 23 Jan 2018 04:28:02 GMT) Full text and rfc822 format available.

Message #26 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Mark H Weaver <mhw <at> netris.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Mon, 22 Jan 2018 23:27:04 -0500
[Message part 1 (text/plain, inline)]
ludo <at> gnu.org (Ludovic Courtès) writes:

> Mark H Weaver <mhw <at> netris.org> skribis:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>>
>>> ludo <at> gnu.org (Ludovic Courtès) writes:
>>>
>>>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>>>>
>>>>> I've encountered the following crash when trying to use substitute on a
>>>>> file which contains NUL characters:
>>>>
>>>> Yes, that’s because Guile’s ‘regexp-exec’ simply wraps libc’s ‘regexec’,
>>>> which does not handle NULs.
>>>>
>>>> We should consider switching to the pure-Scheme SRFI-115:
>>>>
>>>>   https://srfi.schemers.org/srfi-115/srfi-115.html
>>>
>>> This looks good, and I started looking into porting `substitute' to it,
>>> but quickly noticed it doesn't seem to be implemented in Guile yet?
>
> ISTR that the reference implementation works fine on Guile.
>
>> Indeed.  SRFI-115 for Guile is on my TODO list, although it might be
>> better to wait until after we switch to using UTF-8 encoding internally
>> for strings, since that will drastically affect the implementation of
>> any efficient regexp matcher on Scheme strings.
>
> Indeed, though I suppose it doesn’t matter much for the cases where
> ‘substitute*’ is used?
>
>> Anyway, 'substitute*' is to be used only on text files, and NUL bytes
>> are not a valid textual character.  So, I think that this case is
>> outside of what 'substitute*' is meant to do, and therefore not a bug in
>> 'substitute*', although of course a more graceful error would surely be
>> preferable.
>
> Yes, that’s also a good point.
>
> So yeah, I think it may be good “eventually” to switch to SRFI-115, but
> that’s not urgent.

Sorry for taking some time to answer; I was puzzled by the fact that my
repro didn't work when ran from the REPL. It seems the problem only
occurs when run inside Guix's build environment, maybe a side effect
which depends on the locale used?

In the `patch-el-files' phase of the emacs-build-system, we find the
following snippet:

    (with-directory-excursion el-dir
      ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
      ;; with the "ISO-8859-1" locale.
      (unless (false-if-exception (substitute-cmd))
        (with-fluids ((%default-port-encoding "ISO-8859-1"))
          (substitute-cmd))))

In case an exception is returned while processing the file, it is
retried being opened with the "ISO-8859-1" encoding. Or, this resolves
to a call to `open-file', which documentation says:

‘b’
          Use binary mode, ensuring that each byte in the file will be
          read as one Scheme character.

          To provide this property, the file will be opened with the
          8-bit character encoding "ISO-8859-1", ignoring the default
          port encoding.  *Note Ports::, for more information on port
          encodings.

So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
we are doing the same as if we had passed the 'binary option. Could this
explain why we end up with NUL characters where we were expecting text?

To validate this hypothesis, I've added the following test message to
the patch-el-files phase:

      (unless (false-if-exception (substitute-cmd))
        (format (current-error-port) ">>> IS THIS IT? <<<")
        (with-fluids ((%default-port-encoding "ISO-8859-1"))
          (substitute-cmd))))

And re-ran the emacs-realgud build (minus the patch working around this
issue), and this is what I got:

--8<---------------cut here---------------start------------->8---
starting phase `patch-el-files'
>>> IS THIS IT? <<<Backtrace:
          15 (primitive-load "/gnu/store/xdmirz24yxpxzqh8xj83z9h0axm…")
In ice-9/eval.scm:
   191:35 14 (_ _)
In srfi/srfi-1.scm:
   863:16 13 (every1 #<procedure 9a9540 at /gnu/store/mz8vs1cxv1z7y…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/gnu-build-system.scm:
   684:27 12 (_ _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/emacs-build-system.scm:
   117:10 11 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
    640:9 10 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
    849:4  9 (with-throw-handler _ _ _)
In ice-9/ports.scm:
   444:17  8 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
   609:26  7 (_ _)
   635:26  6 (_ #<input: ./realgud/common/bp-image-data.el 17> #<inp…>)
In srfi/srfi-1.scm:
   466:18  5 (fold #<procedure 7ffff53d1520 at /gnu/store/mz8vs1cxv…> …)
In /gnu/store/mz8vs1cxv1z7yrc1awzgby61qnxd481p-module-import/guix/build/utils.scm:
   638:37  4 (_ _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\…")
In ice-9/regex.scm:
   189:12  3 (list-matches _ _ _)
   177:19  2 (fold-matches _ "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿ…" …)
In unknown file:
           1 (regexp-exec #<regexp 81e400> "\"II*\x00(\x03\x00\x00ÿ…" …)
In ice-9/boot-9.scm:
   760:25  0 (dispatch-exception _ _ _)

ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: string contains #\nul character: "\"II*\x00(\x03\x00\x00ÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ@@@@ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿÿÿÿÿÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿBBBBÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x10\x10\x10\x10ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x04\x04\x04\x04ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x01\x01\x01\x01ÿÿÿÿ\x04\x04\x04\x04ÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿ<<<<ÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿÿÿ\x0f\x0f\x0f\x0fÿÿþÿ>>>>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x14\x00\x00\x01\x03\x00\x01\x00\x00\x00\n"
builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
@ build-failed /gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv - 1 builder for `/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv' failed with exit code 1
guix build: error: build failed: build of
`/gnu/store/ar2j6kxz99s3s5wjs2z7ykiw75m9vv72-emacs-realgud-1.4.4.drv'
failed
--8<---------------cut here---------------end--------------->8---

So it is indeed triggered by switching to the "ISO-8859-1" encoding
(although I still cannot reproduce this from the REPL?).

If I remove the exception guard like this:

--8<---------------cut here---------------start------------->8---
     (with-directory-excursion el-dir
       ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
       ;; with the "ISO-8859-1" locale.
-      (unless (false-if-exception (substitute-cmd))
-        (with-fluids ((%default-port-encoding "ISO-8859-1"))
-          (substitute-cmd))))
+      (substitute-cmd))
     #t))
--8<---------------cut here---------------end--------------->8---

the exception thrown on the first substitute* call is this:

--8<---------------cut here---------------start------------->8---
starting phase `patch-el-files'
Backtrace:
          12 (primitive-load "/gnu/store/dvyyqxfr08fsr18k2f43gakh23d…")
In ice-9/eval.scm:
   191:35 11 (_ _)
In srfi/srfi-1.scm:
   863:16 10 (every1 #<procedure 9a96a0 at /gnu/store/xn6p33hhfyz6l…> …)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/gnu-build-system.scm:
   684:27  9 (_ _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/emacs-build-system.scm:
   104:27  8 (patch-el-files #:outputs _)
In srfi/srfi-1.scm:
    640:9  7 (for-each #<procedure substitute-one-file (file-name)> _)
In ice-9/boot-9.scm:
    849:4  6 (with-throw-handler _ _ _)
In ice-9/ports.scm:
   444:17  5 (call-with-input-file _ _ #:binary _ #:encoding _ # _)
In /gnu/store/xn6p33hhfyz6l5j9jd9qpnblp9ajnb9k-module-import/guix/build/utils.scm:
   609:26  4 (_ _)
   645:22  3 (_ #<input: ./realgud/common/bp-image-data.el 15> #<inp…>)
In ice-9/rdelim.scm:
   195:24  2 (read-line _ _)
In unknown file:
           1 (%read-line #<input: ./realgud/common/bp-image-data.el …>)
In ice-9/boot-9.scm:
   760:25  0 (dispatch-exception _ _ _)

ice-9/boot-9.scm:760:25: In procedure dispatch-exception:
ice-9/boot-9.scm:760:25: Throw to key `decoding-error' with args
`("peek-char" "input decoding error" 84 #<input:
./realgud/common/bp-image-data.el 15>)'.
--8<---------------cut here---------------end--------------->8---

Should we keep my workaround for now? It seems there are valid cases to
have the file opened as "ISO-8859-1", but this can mean introducing
binary symbols such as NUL in the data (thus regexp crashes). When we
finally move to srfi-115, we should remove this workaround.

WDYT?

Here's an updated patch with Ludovic's suggestion:

[0001-utils-Prevent-substitute-from-crashing-on-files-cont.patch (text/x-patch, attachment)]
[Message part 3 (text/plain, inline)]
Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Tue, 23 Jan 2018 14:12:02 GMT) Full text and rfc822 format available.

Message #29 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Mark H Weaver <mhw <at> netris.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Tue, 23 Jan 2018 15:11:04 +0100
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> In the `patch-el-files' phase of the emacs-build-system, we find the
> following snippet:
>
>     (with-directory-excursion el-dir
>       ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
>       ;; with the "ISO-8859-1" locale.
>       (unless (false-if-exception (substitute-cmd))
>         (with-fluids ((%default-port-encoding "ISO-8859-1"))
>           (substitute-cmd))))
>
> In case an exception is returned while processing the file, it is
> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
> to a call to `open-file', which documentation says:
>
> ‘b’
>           Use binary mode, ensuring that each byte in the file will be
>           read as one Scheme character.
>
>           To provide this property, the file will be opened with the
>           8-bit character encoding "ISO-8859-1", ignoring the default
>           port encoding.  *Note Ports::, for more information on port
>           encodings.
>
> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
> we are doing the same as if we had passed the 'binary option. Could this
> explain why we end up with NUL characters where we were expecting text?

That could be the reason.  Guile provides a way to honor Emacs-style
‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
#:guess-encoding #t (info "(guile) Character Encoding of Source Files").

Did the faulty file have such a declaration?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 25 Jan 2018 05:12:02 GMT) Full text and rfc822 format available.

Message #32 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Mark H Weaver <mhw <at> netris.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Thu, 25 Jan 2018 00:11:26 -0500
ludo <at> gnu.org (Ludovic Courtès) writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> In the `patch-el-files' phase of the emacs-build-system, we find the
>> following snippet:
>>
>>     (with-directory-excursion el-dir
>>       ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
>>       ;; with the "ISO-8859-1" locale.
>>       (unless (false-if-exception (substitute-cmd))
>>         (with-fluids ((%default-port-encoding "ISO-8859-1"))
>>           (substitute-cmd))))
>>
>> In case an exception is returned while processing the file, it is
>> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
>> to a call to `open-file', which documentation says:
>>
>> ‘b’
>>           Use binary mode, ensuring that each byte in the file will be
>>           read as one Scheme character.
>>
>>           To provide this property, the file will be opened with the
>>           8-bit character encoding "ISO-8859-1", ignoring the default
>>           port encoding.  *Note Ports::, for more information on port
>>           encodings.
>>
>> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
>> we are doing the same as if we had passed the 'binary option. Could this
>> explain why we end up with NUL characters where we were expecting text?
>
> That could be the reason.  Guile provides a way to honor Emacs-style
> ‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
> #:guess-encoding #t (info "(guile) Character Encoding of Source Files").
>
> Did the faulty file have such a declaration?

Sadly, it doesn't. Although even if it did, I don't think it would be
very robust to expect every misbehaving files we might encounter to
include one!

So I think we should apply my v2 patch to core-updates for now (see my
previous reply on this thread), until we have our substitute routine
implemented using srfi-115!

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 25 Jan 2018 11:12:02 GMT) Full text and rfc822 format available.

Message #35 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Mark H Weaver <mhw <at> netris.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Thu, 25 Jan 2018 12:11:30 +0100
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> ludo <at> gnu.org (Ludovic Courtès) writes:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>>
>>> In the `patch-el-files' phase of the emacs-build-system, we find the
>>> following snippet:
>>>
>>>     (with-directory-excursion el-dir
>>>       ;; Some old '.el' files (e.g., tex-buf.el in AUCTeX) are still encoded
>>>       ;; with the "ISO-8859-1" locale.
>>>       (unless (false-if-exception (substitute-cmd))
>>>         (with-fluids ((%default-port-encoding "ISO-8859-1"))
>>>           (substitute-cmd))))
>>>
>>> In case an exception is returned while processing the file, it is
>>> retried being opened with the "ISO-8859-1" encoding. Or, this resolves
>>> to a call to `open-file', which documentation says:
>>>
>>> ‘b’
>>>           Use binary mode, ensuring that each byte in the file will be
>>>           read as one Scheme character.
>>>
>>>           To provide this property, the file will be opened with the
>>>           8-bit character encoding "ISO-8859-1", ignoring the default
>>>           port encoding.  *Note Ports::, for more information on port
>>>           encodings.
>>>
>>> So, by opening an file whose encoding is unknown as a ISO-8859-1 file,
>>> we are doing the same as if we had passed the 'binary option. Could this
>>> explain why we end up with NUL characters where we were expecting text?
>>
>> That could be the reason.  Guile provides a way to honor Emacs-style
>> ‘encoding’ declarations, and ‘call-with-input-file’ does that if we pass
>> #:guess-encoding #t (info "(guile) Character Encoding of Source Files").
>>
>> Did the faulty file have such a declaration?
>
> Sadly, it doesn't. Although even if it did, I don't think it would be
> very robust to expect every misbehaving files we might encounter to
> include one!

Sure, I was asking just because it’s an Emacs-related package.

> So I think we should apply my v2 patch to core-updates for now (see my
> previous reply on this thread), until we have our substitute routine
> implemented using srfi-115!

Sounds good!  Note that I’ll wait until after the current ‘core-updates’
has been merged.  Please do ping me if you think I’ve forgotten!

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 14 Jun 2018 01:41:01 GMT) Full text and rfc822 format available.

Message #38 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Wed, 13 Jun 2018 21:40:49 -0400
ludo <at> gnu.org (Ludovic Courtès) writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>>  NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>
> [...]
>
>> +           ((string-contains line (make-string 1 #\nul))
>
> Rather (string-index line #\nul).
>
>> +            ;; The regexp functions of the GNU C library (which Guile uses)
>> +            ;; cannot deal with NUL characters, so skip to the next line.
>> +            (format #t "skipping line with NUL characters: ~s\n" line)
>> +            (loop (read-line in 'concat)))
>
> Rather (format (current-error-port) …).
>
> It’s strange semantics, but it’s probably better than crashing in the
> contexts where we use it.
>
> Otherwise LGTM.  This would have to go to the next ‘core-updates’ (or
> ‘core-updates-next’ in the meantime.)
>
> Thanks!
>
> Ludo’.

Ping. Is it the right time to merge this?

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 14 Jun 2018 07:05:01 GMT) Full text and rfc822 format available.

Message #41 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Thu, 14 Jun 2018 03:02:47 -0400
Hi Maxim,

Thanks for working on this.  I found a problem with this patch,
and I also have a suggestion.  Please see below.

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
> Date: Sun, 14 Jan 2018 20:31:33 -0500
> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>  NUL chars.
>
> Fixes issue #30116.
>
> * guix/build/utils.scm (substitute): Add condition to skip lines containing
> the NUL character.
> ---
>  guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
> index 7391307c8..975f4e70a 100644
> --- a/guix/build/utils.scm
> +++ b/guix/build/utils.scm
> @@ -3,6 +3,7 @@
>  ;;; Copyright © 2013 Andreas Enge <andreas <at> enge.fr>
>  ;;; Copyright © 2013 Nikita Karetnikov <nikita <at> karetnikov.org>
>  ;;; Copyright © 2015 Mark H Weaver <mhw <at> netris.org>
> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
>  a substitution of the original line.  Be careful about using '$' to match the
>  end of a line; by itself it won't match the terminating newline of a line."
>    (let ((rx+proc  (map (match-lambda
> -                        (((? regexp? pattern) . proc)
> -                         (cons pattern proc))
> -                        ((pattern . proc)
> -                         (cons (make-regexp pattern regexp/extended)
> -                               proc)))
> +                         (((? regexp? pattern) . proc)
> +                          (cons pattern proc))
> +                         ((pattern . proc)
> +                          (cons (make-regexp pattern regexp/extended)
> +                                proc)))
>                         pattern+procs)))
>      (with-atomic-file-replacement file
>        (lambda (in out)
>          (let loop ((line (read-line in 'concat)))
> -          (if (eof-object? line)
> -              #t
> -              (let ((line (fold (lambda (r+p line)
> -                                  (match r+p
> -                                    ((regexp . proc)
> -                                     (match (list-matches regexp line)
> -                                       ((and m+ (_ _ ...))
> -                                        (proc line m+))
> -                                       (_ line)))))
> -                                line
> -                                rx+proc)))
> -                (display line out)
> -                (loop (read-line in 'concat)))))))))
> +          (cond
> +           ((eof-object? line)
> +            #t)
> +           ((string-contains line (make-string 1 #\nul))
> +            ;; The regexp functions of the GNU C library (which Guile uses)
> +            ;; cannot deal with NUL characters, so skip to the next line.
> +            (format #t "skipping line with NUL characters: ~s\n" line)
> +            (loop (read-line in 'concat)))

This code will unconditionally *delete* all lines that contain NULs.

This will happen because the lines with NULs are not being written to
the output file, which will replace the original file when this loop
reaches EOF.  So, any lines that are not copied to the output will be
lost.

To preserve the lines with NULs, you should call (display line out)
before calling 'loop'.

Also, please use (string-index line #\nul) to check for NULs instead of
'string-contains'.  It should be more efficient.

> +           (else
> +            (let ((line (fold (lambda (r+p line)
> +                                (match r+p
> +                                  ((regexp . proc)
> +                                   (match (list-matches regexp line)
> +                                     ((and m+ (_ _ ...))
> +                                      (proc line m+))
> +                                     (_ line)))))
> +                              line
> +                              rx+proc)))
> +              (display line out)
> +              (loop (read-line in 'concat))))))))))

With the changes suggested above, I would have no objection to pushing
this to core-updates.  However, it occurs to me that we could handle the
NUL case in a better way:

Since the C regex functions that we use cannot handle NUL bytes, we
could use a different code point to represent NUL during those
operations.  We could choose a code point from one of the Unicode
Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
does not occur in the string.

Let NUL* be the code point which will represent NUL bytes.  First
replace all NULs with NUL*s, then perform the substitutions, and finally
replace all ALT*s with NULs before writing to the output.

As an important optimization, we should avoid performing these extra
operations unless (string-index line #\nul) finds a NUL.

We could then perform these extra substitutions with simple, inefficient
code.  Maybe something like this (untested):

--8<---------------cut here---------------start------------->8---
    (with-atomic-file-replacement file
      (lambda (in out)
        (let loop ((line (read-line in 'concat)))
          (if (eof-object? line)
              #t
              (let* ((nul* (or (and (string-index line #\nul)
                                    (unused-private-use-code-point line))
                               #\nul))
                     (line* (replace-char #\nul nul* line))
                     (line1* (fold (lambda (r+p line)
                                     (match r+p
                                       ((regexp . proc)
                                        (match (list-matches regexp line)
                                          ((and m+ (_ _ ...))
                                           (proc line m+))
                                          (_ line)))))
                                   line*
                                   rx+proc))
                     (line1 (replace-char nul* #\nul line1*)))
                (display line1 out)
                (loop (read-line in 'concat)))))))))
--8<---------------cut here---------------end--------------->8---

Where the following additional private procedures would be added to
(guix build utils) above the definition for 'substitute':

--8<---------------cut here---------------start------------->8---
(define (unused-private-use-code-point s)
  "Find a code point within a Unicode Private Use Area that is not
present in S, and return the corresponding character object.  If one
cannot be found, return false."
  (define (scan lo hi)
    (and (<= lo hi)
         (let ((c (integer->char lo)))
           (if (string-index s c)
               (scan (+ lo 1) hi)
               c))))
  (or (scan   #xE000   #xF8FF)
      (scan  #xF0000  #xFFFFD)
      (scan #x100000 #x10FFFD)))

(define (replace-char c1 c2 s)
  "Return a string which is equal to S except with all instances of C1
replaced by C2.  If C1 and C2 are equal, return S."
  (if (char=? c1 c2)
      s
      (string-map (lambda (c)
                    (if (char=? c c1)
                        c2
                        c))
                  s)))
--8<---------------cut here---------------end--------------->8---

What do you think?

      Mark




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 14 Jun 2018 08:02:01 GMT) Full text and rfc822 format available.

Message #44 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Thu, 14 Jun 2018 10:01:09 +0200
Hello Maxim,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> ludo <at> gnu.org (Ludovic Courtès) writes:

[...]

>> Otherwise LGTM.  This would have to go to the next ‘core-updates’ (or
>> ‘core-updates-next’ in the meantime.)
>>
>> Thanks!
>>
>> Ludo’.
>
> Ping. Is it the right time to merge this?

Yes you can push it to ‘core-updates’ now.  Thank you!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Thu, 14 Jun 2018 08:03:02 GMT) Full text and rfc822 format available.

Message #47 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: 30116 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Thu, 14 Jun 2018 10:02:23 +0200
Mark H Weaver <mhw <at> netris.org> skribis:

> Thanks for working on this.  I found a problem with this patch,
> and I also have a suggestion.  Please see below.

I hadn’t seen Mark’s reply, which raises valid concerns.  Please dismiss
the message I just sent, Maxim.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Sat, 16 Jun 2018 16:48:02 GMT) Full text and rfc822 format available.

Message #50 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Sat, 16 Jun 2018 12:47:01 -0400
Hi Mark,

Mark H Weaver <mhw <at> netris.org> writes:

> Hi Maxim,
>
> Thanks for working on this.  I found a problem with this patch,
> and I also have a suggestion.  Please see below.
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
>> Date: Sun, 14 Jan 2018 20:31:33 -0500
>> Subject: [PATCH] utils: Prevent substitute from crashing on files containing
>>  NUL chars.
>>
>> Fixes issue #30116.
>>
>> * guix/build/utils.scm (substitute): Add condition to skip lines containing
>> the NUL character.
>> ---
>>  guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------
>>  1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/guix/build/utils.scm b/guix/build/utils.scm
>> index 7391307c8..975f4e70a 100644
>> --- a/guix/build/utils.scm
>> +++ b/guix/build/utils.scm
>> @@ -3,6 +3,7 @@
>>  ;;; Copyright © 2013 Andreas Enge <andreas <at> enge.fr>
>>  ;;; Copyright © 2013 Nikita Karetnikov <nikita <at> karetnikov.org>
>>  ;;; Copyright © 2015 Mark H Weaver <mhw <at> netris.org>
>> +;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
>>  ;;;
>>  ;;; This file is part of GNU Guix.
>>  ;;;
>> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line that will be written as
>>  a substitution of the original line.  Be careful about using '$' to match the
>>  end of a line; by itself it won't match the terminating newline of a line."
>>    (let ((rx+proc  (map (match-lambda
>> -                        (((? regexp? pattern) . proc)
>> -                         (cons pattern proc))
>> -                        ((pattern . proc)
>> -                         (cons (make-regexp pattern regexp/extended)
>> -                               proc)))
>> +                         (((? regexp? pattern) . proc)
>> +                          (cons pattern proc))
>> +                         ((pattern . proc)
>> +                          (cons (make-regexp pattern regexp/extended)
>> +                                proc)))
>>                         pattern+procs)))
>>      (with-atomic-file-replacement file
>>        (lambda (in out)
>>          (let loop ((line (read-line in 'concat)))
>> -          (if (eof-object? line)
>> -              #t
>> -              (let ((line (fold (lambda (r+p line)
>> -                                  (match r+p
>> -                                    ((regexp . proc)
>> -                                     (match (list-matches regexp line)
>> -                                       ((and m+ (_ _ ...))
>> -                                        (proc line m+))
>> -                                       (_ line)))))
>> -                                line
>> -                                rx+proc)))
>> -                (display line out)
>> -                (loop (read-line in 'concat)))))))))
>> +          (cond
>> +           ((eof-object? line)
>> +            #t)
>> +           ((string-contains line (make-string 1 #\nul))
>> +            ;; The regexp functions of the GNU C library (which Guile uses)
>> +            ;; cannot deal with NUL characters, so skip to the next line.
>> +            (format #t "skipping line with NUL characters: ~s\n" line)
>> +            (loop (read-line in 'concat)))
>
> This code will unconditionally *delete* all lines that contain NULs.
>
> This will happen because the lines with NULs are not being written to
> the output file, which will replace the original file when this loop
> reaches EOF.  So, any lines that are not copied to the output will be
> lost.
>
> To preserve the lines with NULs, you should call (display line out)
> before calling 'loop'.

Good observation! I agree that we should keep limit the effect of
ignoring NULs only to the substitution.

> Also, please use (string-index line #\nul) to check for NULs instead of
> 'string-contains'.  It should be more efficient.

OK!

>
>> +           (else
>> +            (let ((line (fold (lambda (r+p line)
>> +                                (match r+p
>> +                                  ((regexp . proc)
>> +                                   (match (list-matches regexp line)
>> +                                     ((and m+ (_ _ ...))
>> +                                      (proc line m+))
>> +                                     (_ line)))))
>> +                              line
>> +                              rx+proc)))
>> +              (display line out)
>> +              (loop (read-line in 'concat))))))))))
>
> With the changes suggested above, I would have no objection to pushing
> this to core-updates.  However, it occurs to me that we could handle the
> NUL case in a better way:
>
> Since the C regex functions that we use cannot handle NUL bytes, we
> could use a different code point to represent NUL during those
> operations.  We could choose a code point from one of the Unicode
> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
> does not occur in the string.
>
> Let NUL* be the code point which will represent NUL bytes.  First
> replace all NULs with NUL*s, then perform the substitutions, and finally
> replace all ALT*s with NULs before writing to the output.

Do I understand this transformation as NULs -> NUL*s and back from NUL*s
-> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

> As an important optimization, we should avoid performing these extra
> operations unless (string-index line #\nul) finds a NUL.

OK.

> We could then perform these extra substitutions with simple, inefficient
> code.  Maybe something like this (untested):
>
>     (with-atomic-file-replacement file
>       (lambda (in out)
>         (let loop ((line (read-line in 'concat)))
>           (if (eof-object? line)
>               #t
>               (let* ((nul* (or (and (string-index line #\nul)
>                                     (unused-private-use-code-point line))
>                                #\nul))
>                      (line* (replace-char #\nul nul* line))
>                      (line1* (fold (lambda (r+p line)
>                                      (match r+p
>                                        ((regexp . proc)
>                                         (match (list-matches regexp line)
>                                           ((and m+ (_ _ ...))
>                                            (proc line m+))
>                                           (_ line)))))
>                                    line*
>                                    rx+proc))
>                      (line1 (replace-char nul* #\nul line1*)))
>                 (display line1 out)
>                 (loop (read-line in 'concat)))))))))
>
>
> Where the following additional private procedures would be added to
> (guix build utils) above the definition for 'substitute':
>
> (define (unused-private-use-code-point s)
>   "Find a code point within a Unicode Private Use Area that is not
> present in S, and return the corresponding character object.  If one
> cannot be found, return false."
>   (define (scan lo hi)
>     (and (<= lo hi)
>          (let ((c (integer->char lo)))
>            (if (string-index s c)
>                (scan (+ lo 1) hi)
>                c))))
>   (or (scan   #xE000   #xF8FF)
>       (scan  #xF0000  #xFFFFD)
>       (scan #x100000 #x10FFFD)))
>
> (define (replace-char c1 c2 s)
>   "Return a string which is equal to S except with all instances of C1
> replaced by C2.  If C1 and C2 are equal, return S."
>   (if (char=? c1 c2)
>       s
>       (string-map (lambda (c)
>                     (if (char=? c c1)
>                         c2
>                         c))
>                   s)))
>
> What do you think?

It raises the complexity level a bit for something which doesn't seem to
be a very common scenario, but otherwise seems a very elegant
workaround. It seems to me that your implementation is already pretty
complete. I'll try write a test for validating it and report back.

Thank you for sharing your ideas!

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Sun, 17 Jun 2018 04:38:02 GMT) Full text and rfc822 format available.

Message #53 received at 30116 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 30116 <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates))
Date: Sun, 17 Jun 2018 00:36:00 -0400
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Mark H Weaver <mhw <at> netris.org> writes:
>
>> With the changes suggested above, I would have no objection to pushing
>> this to core-updates.  However, it occurs to me that we could handle the
>> NUL case in a better way:
>>
>> Since the C regex functions that we use cannot handle NUL bytes, we
>> could use a different code point to represent NUL during those
>> operations.  We could choose a code point from one of the Unicode
>> Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that
>> does not occur in the string.
>>
>> Let NUL* be the code point which will represent NUL bytes.  First
>> replace all NULs with NUL*s, then perform the substitutions, and finally
>> replace all ALT*s with NULs before writing to the output.
>
> Do I understand this transformation as NULs -> NUL*s and back from NUL*s
> -> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation.

Sorry, it's a typo.  Where I wrote "ALT*s", I meant to write "NUL*s".

>> What do you think?
>
> It raises the complexity level a bit for something which doesn't seem to
> be a very common scenario,

FWIW, I agree that it's not a common scenario, and it's not entirely
clear that it was worth the time I spent on it, or the added complexity.
On the other hand, I would dislike having a basic API like 'substitute*'
be subtly broken in this way.

> but otherwise seems a very elegant
> workaround. It seems to me that your implementation is already pretty
> complete. I'll try write a test for validating it and report back.

Sounds good.  Thank you!

      Mark




Reply sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
You have taken responsibility. (Fri, 08 Jan 2021 19:16:02 GMT) Full text and rfc822 format available.

Notification sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
bug acknowledged by developer. (Fri, 08 Jan 2021 19:16:02 GMT) Full text and rfc822 format available.

Message #58 received at 30116-done <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 30116-done <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Fri, 08 Jan 2021 14:14:56 -0500
Hello Mark,

I was recently reminded of this bug by a new encounter; at last wrote a
test for your proposed fix, and it appear to work as intended!  I've
committed it on your behalf in commit 485ac28235 on the core-updates
branch.

Closing!

Thank you for the clever hack :-)

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#30116; Package guix. (Fri, 08 Jan 2021 21:44:02 GMT) Full text and rfc822 format available.

Message #61 received at 30116-done <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 30116-done <at> debbugs.gnu.org
Subject: Re: bug#30116: [PATCH] `substitute' crashes when file contains NUL
 characters (core-updates)
Date: Fri, 08 Jan 2021 16:42:43 -0500
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> I was recently reminded of this bug by a new encounter; at last wrote a
> test for your proposed fix, and it appear to work as intended!  I've
> committed it on your behalf in commit 485ac28235 on the core-updates
> branch.

Thanks for taking care of this Maxim, and for adding the test case.

    Regards,
      Mark




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 06 Feb 2021 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 51 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.