GNU bug report logs - #16286
24.3.50; insert-file-contents may bring invisible garbage

Previous Next

Package: emacs;

Reported by: Andrey Kotlarski <m00naticus <at> gmail.com>

Date: Sun, 29 Dec 2013 14:06:02 UTC

Severity: important

Found in version 24.3.50

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16286 in the body.
You can then email your comments to 16286 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Sun, 29 Dec 2013 14:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrey Kotlarski <m00naticus <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 29 Dec 2013 14:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrey Kotlarski <m00naticus <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3.50; insert-file-contents may bring invisible garbage
Date: Sun, 29 Dec 2013 16:05:22 +0200
In trunk inserting few bytes from file may sometimes result in nothing
visible in the buffer while invisible artifacts are present and may
affect subsequent operations.  Moreover, there doesn't seem to be way to
recover from this.  Here's example session with emacs -Q:

(let ((file "test.txt"))
  (unless (file-exists-p file)
    (find-file file)
    (insert "абв")                      ;Cyrillic letters
    (save-buffer)
    (kill-buffer))

  (let ((buf (generate-new-buffer "test")))
    (switch-to-buffer buf)
    (insert-file-contents file nil 0 2) ;inserts а
    (goto-char (point-max))
    (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer
                                        ;but actually there is
    (erase-buffer)                      ;and still is
    (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1
    (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty
    ))

Trying to insert multibyte characters now brings content length issues,
garbage inserted and at some point Emacs crashes.

In release 24.3 and earlier insert-file-contents seems to always insert
something, be it wrongly decoded or raw eight-bit characters.  But it is
visible and easy to deal with.  The above example works fine there.
This is useful for the vlf package (https://github.com/m00natic/vlfi) as
a way to detect insufficient amount of bytes requested and allows
further adjustment.


In GNU Emacs 24.3.50.1 (x86_64-pc-linux-gnu)
 of 2013-12-29 on andrexhe
Bzr revision: 115803 eggert <at> cs.ucla.edu-20131229075253-hmeofd1oihd5n3rk
Windowing system distributor `The X.Org Foundation', version 11.0.11405000
Configured using:
 `configure --build=x86_64-pc-linux-gnu --enable-link-time-optimization
 --with-x-toolkit=no --with-wide-int --without-toolkit-scroll-bars
 --without-xaw3d --without-gpm --without-gconf --without-gsettings
 build_alias=x86_64-pc-linux-gnu 'CFLAGS=-march=native -mtune=native -O2
 -pipe''

Important settings:
  value of $LC_COLLATE: C
  value of $LANG: bg_BG.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Fundamental

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e p o - e m - b u <tab> <return>

Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
let: Wrong type argument: inserted-chars, 1

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
easymenu mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util help-fns mail-prsvr mail-utils time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type mwheel x-win x-dnd tool-bar
dnd fontset image regexp-opt fringe tabulated-list newcomment lisp-mode
prog-mode register page menu-bar rfn-eshadow timer select scroll-bar
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify dynamic-setting font-render-setting x multi-tty
emacs)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Thu, 02 Jan 2014 16:31:01 GMT) Full text and rfc822 format available.

Message #8 received at 16286 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrey Kotlarski <m00naticus <at> gmail.com>, Kenichi Handa <handa <at> gnu.org>
Cc: 16286 <at> debbugs.gnu.org
Subject: Re: bug#16286: 24.3.50;
 insert-file-contents may bring invisible garbage
Date: Thu, 02 Jan 2014 18:30:30 +0200
> From: Andrey Kotlarski <m00naticus <at> gmail.com>
> Date: Sun, 29 Dec 2013 16:05:22 +0200
> 
> In trunk inserting few bytes from file may sometimes result in nothing
> visible in the buffer while invisible artifacts are present and may
> affect subsequent operations.  Moreover, there doesn't seem to be way to
> recover from this.  Here's example session with emacs -Q:
> 
> (let ((file "test.txt"))
>   (unless (file-exists-p file)
>     (find-file file)
>     (insert "абв")                      ;Cyrillic letters
>     (save-buffer)
>     (kill-buffer))
> 
>   (let ((buf (generate-new-buffer "test")))
>     (switch-to-buffer buf)
>     (insert-file-contents file nil 0 2) ;inserts а
>     (goto-char (point-max))
>     (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer
>                                         ;but actually there is
>     (erase-buffer)                      ;and still is
>     (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1
>     (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty
>     ))
> 
> Trying to insert multibyte characters now brings content length issues,
> garbage inserted and at some point Emacs crashes.

Your Emacs is built without --enable-checking; if that configure-time
switch is used, Emacs hits an assertion violation as soon as this sexp
is evaluated:

     (insert-file-contents file nil 2 3)

Also, you are wrong about there being some invisible stuff in the
buffer.  The problem is elsewhere: Emacs gets confused about the
number of characters and the number of bytes in the buffer.  These two
counts should be in sync at all times; once they become
unsynchronized, Emacs will generally crash very soon.

I'm CC'ing Handa-san in the hope that he will be able to suggest a
solution.

The problem happens in decode_coding_gap (called from
insert-file-contents), in this code fragment (note the call to
detect_coding):

  if (CODING_REQUIRE_DETECTION (coding))
    detect_coding (coding);
  attrs = CODING_ID_ATTRS (coding->id);
  if (! disable_ascii_optimization
      && ! coding->src_multibyte
      && ! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
      && NILP (CODING_ATTR_POST_READ (attrs))
      && NILP (get_translation_table (attrs, 0, NULL)))
    {
      chars = coding->head_ascii;
      if (chars < 0)
	chars = check_ascii (coding);
      if (chars != bytes)
	{
	  /* There exists a non-ASCII byte.  */
	  if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8))
	    {
	      if (coding->detected_utf8_chars >= 0)
		chars = coding->detected_utf8_chars;  <<<<<<<<<<<<<<
	      else
		chars = check_utf_8 (coding);

This reuses the number of characters that are valid UTF-8 sequences in
the byte stream to be decoded, stored in coding->detected_utf8_chars,
which were found by detect_coding_utf_8, which was called by
detect_coding.  In the case in point, detect_coding_utf_8 finds zero
valid UTF-8 sequences, and so 'chars' becomes zero.  But the number of
decoded bytes is not adjusted to fit that, so it stays at its original
value of 1.  Then, decode_coding_gap does this:

	  coding->produced = bytes;
	  coding->produced_char = chars;
	  insert_from_gap (chars, bytes, 1);

Since 'chars' is zero, but 'bytes' is 1, this causes a mismatch
between buffer's Z and Z_BYTE values, and from there it's a slippery
slope all the way to an assertion violation during redisplay.

Similar problems happen when insert-file-contents is called to read
some number of bytes that doesn't end at a UTF-8 sequence boundary.

I think I see a potential reason for this in detect_coding_utf_8, near
its end:

      if (nchars < src_end - coding->source)
	/* The found characters are less than source bytes, which
	   means that we found a valid non-ASCII characters.  */
	detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG;

This misses the use case such as this one, where the detection loop
consumed one byte, found it not to be the head byte of a UTF-8
sequence, and then hit the end of the source bytes.  It looks like the
function incorrectly returns a success indication in this case, which
might be part of the problem.

> In release 24.3 and earlier insert-file-contents seems to always insert
> something, be it wrongly decoded or raw eight-bit characters.  But it is
> visible and easy to deal with.  The above example works fine there.
> This is useful for the vlf package (https://github.com/m00natic/vlfi) as
> a way to detect insufficient amount of bytes requested and allows
> further adjustment.

What vlf does is strange and IMO not the best possible solution to
this issue:

        (cond ((vlf-partial-decode-shown-p) ;remove raw bytes from end
               (goto-char (point-max))
               (while (eq (char-charset (preceding-char)) 'eight-bit)
                 (setq shift-end (1- shift-end))
                 (delete-char -1)))
              ((< end vlf-file-size) ;add bytes until new character is displayed
               (let ((position (or position (point-min)))
                     (expected-size (buffer-size)))
                 (while (and (progn
                               (setq shift-end (1+ shift-end)
                                     end (1+ end))
                               (delete-region position (point-max))
                               (goto-char position)
                               (insert-file-contents buffer-file-name
                                                     nil start end)
                               (< end vlf-file-size))
                             (= expected-size (buffer-size))))))))

This seems to have a subtle misfeature of not supporting files with
inconsistent encoding, or files with binary data, because there _all_
characters will belong to the eight-bit charset.  Also, I don't
understand why the removal of raw bytes is conditioned on Emacs
version: why not just remove them unconditionally: if there are none,
nothing will be removed.

More to the point, I'm not sure whether inserting raw bytes in
insert-file-contents when a portion of a multibyte sequence was read
(i.e. go back to what Emacs 24.3 did) will be good for vlf.  It sounds
to me much better if Emacs would only return complete characters read
from the file, so that applications will not need to remove those
stray bytes.

Finally, it would seem a better design for vlf to always read a few
more bytes than was requested into some scratch buffer, and then
decode them manually to determine just how many to copy to the main
buffer.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Sat, 04 Jan 2014 22:43:02 GMT) Full text and rfc822 format available.

Message #11 received at 16286 <at> debbugs.gnu.org (full text, mbox):

From: Andrey Kotlarski <m00naticus <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Kenichi Handa <handa <at> gnu.org>, 16286 <at> debbugs.gnu.org
Subject: Re: bug#16286: 24.3.50;
 insert-file-contents may bring invisible garbage
Date: Sun, 05 Jan 2014 00:42:39 +0200
Thanks a lot for the hints and pointers.

[  2 януари 2014, 18:30 +0200, четвъртък ] Eli Zaretskii:

> What vlf does is strange and IMO not the best possible solution to
> this issue:
> ...
> This seems to have a subtle misfeature of not supporting files with
> inconsistent encoding, or files with binary data, because there _all_
> characters will belong to the eight-bit charset.

There had been changes meanwhile which hopefully address these (no
special treatment of eight-bit) along other issues (vlf-base.el).

> More to the point, I'm not sure whether inserting raw bytes in
> insert-file-contents when a portion of a multibyte sequence was read
> (i.e. go back to what Emacs 24.3 did) will be good for vlf.  It sounds
> to me much better if Emacs would only return complete characters read
> from the file, so that applications will not need to remove those
> stray bytes.

I agree.  It would be ideal for vlf if insert-file-contents would also
report the number of stray bytes at the end that haven't been utilized.

> Finally, it would seem a better design for vlf to always read a few
> more bytes than was requested into some scratch buffer, and then
> decode them manually to determine just how many to copy to the main
> buffer.

I see that vlf somehow works only by some accident with current trunk
(and --enable-checking disabled), so I'm on it.  My initial attempt at
naively combining insert-file-contents-literally with
decode-coding-inserted-region though often produces wrong decoding where
insert-file-contents would be good.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 26 Jan 2014 00:37:02 GMT) Full text and rfc822 format available.

Notification sent to Andrey Kotlarski <m00naticus <at> gmail.com>:
bug acknowledged by developer. (Sun, 26 Jan 2014 00:37:03 GMT) Full text and rfc822 format available.

Message #16 received at 16286-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16286-done <at> debbugs.gnu.org
Cc: Kenichi Handa <handa <at> gnu.org>
Subject: Re: 24.3.50; insert-file-contents may bring invisible garbage
Date: Sat, 25 Jan 2014 16:36:29 -0800
I installed a patch as trunk bzr 116158, which (at least for me) fixes 
the reported bug, and am taking the liberty of marking this as done. 
There may well be a better fix, but at least Emacs shouldn't crash or 
report nonsense now.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Mon, 27 Jan 2014 15:02:01 GMT) Full text and rfc822 format available.

Message #19 received at 16286-done <at> debbugs.gnu.org (full text, mbox):

From: handa <at> gnu.org (K. Handa)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16286-done <at> debbugs.gnu.org
Subject: Re: 24.3.50; insert-file-contents may bring invisible garbage
Date: Tue, 28 Jan 2014 00:01:00 +0900
In article <52E4588D.70004 <at> cs.ucla.edu>, Paul Eggert <eggert <at> cs.ucla.edu> writes:

> I installed a patch as trunk bzr 116158, which (at least for me) fixes 
> the reported bug, and am taking the liberty of marking this as done. 
> There may well be a better fix, but at least Emacs shouldn't crash or 
> report nonsense now.

Thank you for working on this bug which I introduced when I
made decode_coding_gap optimized for ASCII and UTF-8 only
files.  

Your change is to set CODING_MODE_LAST_BLOCK in coding->mode
before calling decode_coding_gap so that detect_coding
doesn't detect a file as utf-8 if it has incomplete utf-8
sequence at the tail (as the reported testcase).

But, I think it is better that detect_coding detects such a
file as utf-8 and treats the trailing garbage as raw bytes.
24.3 does it, and that is why decode_coding_gap sets
CODING_MODE_LAST_BLOCK after calling detect_coding.

So, I suggest the attached fix instead of yours.  What do
you think?

---
Kenichi Handa
handa <at> gnu.org

=== modified file 'src/ChangeLog'
--- src/ChangeLog	2014-01-26 12:17:55 +0000
+++ src/ChangeLog	2014-01-27 14:53:58 +0000
@@ -1,3 +1,16 @@
+2014-01-27  K. Handa  <handa <at> gnu.org>
+
+	These change are to fix bug#16286 in the different way than what
+	done by revno:116158.
+
+	* coding.h (struct coding_system): New member detected_utf8_bytes.
+
+	* coding.c (detect_coding_utf_8): Set coding->detected_utf8_bytes.
+	(decode_coding_gap): Use short cut for UTF-8 file reading only
+	when coding->detected_utf8_bytes equals to coding->src_bytes.
+
+	* fileio.c (Finsert_file_contents): Cancel the previous change.
+
 2014-01-26  Jan Djärv  <jan.h.d <at> swipnet.se>
 
 	* xterm.c (x_focus_changed): Check for non-X terminal-frame (Bug#16540)

=== modified file 'src/coding.c'
--- src/coding.c	2014-01-26 01:20:24 +0000
+++ src/coding.c	2014-01-27 14:47:43 +0000
@@ -1300,6 +1300,7 @@
 	   means that we found a valid non-ASCII characters.  */
 	detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG;
     }
+  coding->detected_utf8_bytes = src_base - coding->source;
   coding->detected_utf8_chars = nchars;
   return 1;
 }
@@ -7890,7 +7891,7 @@
   coding->dst_multibyte = ! NILP (BVAR (current_buffer, enable_multibyte_characters));
 
   coding->head_ascii = -1;
-  coding->detected_utf8_chars = -1;
+  coding->detected_utf8_bytes = coding->detected_utf8_chars = -1;
   coding->eol_seen = EOL_SEEN_NONE;
   if (CODING_REQUIRE_DETECTION (coding))
     detect_coding (coding);
@@ -7907,7 +7908,8 @@
       if (chars != bytes)
 	{
 	  /* There exists a non-ASCII byte.  */
-	  if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8))
+	  if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8)
+	      && coding->detected_utf8_bytes == coding->src_bytes)
 	    {
 	      if (coding->detected_utf8_chars >= 0)
 		chars = coding->detected_utf8_chars;

=== modified file 'src/coding.h'
--- src/coding.h	2014-01-26 01:20:24 +0000
+++ src/coding.h	2014-01-27 14:47:43 +0000
@@ -468,7 +468,9 @@
      the eol format.  */
   ptrdiff_t head_ascii;
 
-  ptrdiff_t detected_utf8_chars;
+  /* How many bytes/chars at the source are detected as valid utf-8
+     sequence.  Set by detect_coding_utf_8.  */
+  ptrdiff_t detected_utf8_bytes, detected_utf8_chars;
 
   /* Used internally in coding.c.  See the comment of detect_ascii.  */
   int eol_seen;

=== modified file 'src/fileio.c'
--- src/fileio.c	2014-01-26 00:32:30 +0000
+++ src/fileio.c	2014-01-27 14:47:59 +0000
@@ -4298,7 +4298,6 @@
       Z_BYTE -= inserted;
       ZV -= inserted;
       Z -= inserted;
-      coding.mode |= CODING_MODE_LAST_BLOCK;
       decode_coding_gap (&coding, inserted, inserted);
       inserted = coding.produced_char;
       coding_system = CODING_ID_NAME (coding.id);





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Mon, 27 Jan 2014 17:02:02 GMT) Full text and rfc822 format available.

Message #22 received at 16286-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "K. Handa" <handa <at> gnu.org>
Cc: 16286-done <at> debbugs.gnu.org
Subject: Re: 24.3.50; insert-file-contents may bring invisible garbage
Date: Mon, 27 Jan 2014 09:01:18 -0800
Yes, thanks, that looks like a better patch.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16286; Package emacs. (Wed, 29 Jan 2014 13:42:02 GMT) Full text and rfc822 format available.

Message #25 received at 16286-done <at> debbugs.gnu.org (full text, mbox):

From: handa <at> gnu.org (K. Handa)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16286-done <at> debbugs.gnu.org
Subject: Re: 24.3.50; insert-file-contents may bring invisible garbage
Date: Wed, 29 Jan 2014 22:40:51 +0900
In article <52E690DE.6070101 <at> cs.ucla.edu>, Paul Eggert <eggert <at> cs.ucla.edu> writes:

> Yes, thanks, that looks like a better patch.

Ok, I've just committed it.

---
Kenichi Handa
handa <at> gnu.org




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 27 Feb 2014 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 70 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.