GNU bug report logs - #37580
26.3; setting buffer as unibyte temporarily may change buffer contents

Previous Next

Package: emacs;

Reported by: ynyaaa <at> gmail.com

Date: Wed, 2 Oct 2019 09:44:01 UTC

Severity: normal

Tags: notabug

Found in version 26.3

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37580 in the body.
You can then email your comments to 37580 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#37580; Package emacs. (Wed, 02 Oct 2019 09:44:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to ynyaaa <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 02 Oct 2019 09:44:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: bug-gnu-emacs <at> gnu.org
Subject: 26.3; setting buffer as unibyte temporarily may change buffer contents
Date: Wed, 02 Oct 2019 18:43:45 +0900
If a multibyte buffer contains eight-bit character sequences,
evaluating the form
 (progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
may convert them to multibyte characters.

Afterwards, buffer-undo-list may be inappropriate.
Undo in the form below changes the position of character '1'.

(with-temp-buffer
  (insert "一123")
  (encode-coding-region 1 2 'utf-8)
  (buffer-enable-undo)
  (undo-boundary)
  (progn (goto-char (point-min))
         (search-forward "1")
         (delete-char -1))
  (undo-boundary)
  (progn (set-buffer-multibyte nil)
         (set-buffer-multibyte t))
  (undo)
  (buffer-string))
=>"一231"


In GNU Emacs 26.3 (build 1, x86_64-w64-mingw32)
 of 2019-08-29 built on CIRROCUMULUS
Repository revision: 96dd0196c28bc36779584e47fffcca433c9309cd
Windowing system distributor 'Microsoft Corp.', version 6.3.9600
Recent messages:

Configured using:
 'configure --without-dbus --host=x86_64-w64-mingw32
 --without-compress-install 'CFLAGS=-O2 -static -g3''

Configured features:
XPM JPEG TIFF GIF PNG RSVG SOUND NOTIFY ACL GNUTLS LIBXML2 ZLIB
TOOLKIT_SCROLL_BARS THREADS LCMS2

Important settings:
  value of $LANG: JPN
  locale-coding-system: cp932

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t

Load-path shadows:
None found.

Features:
(network-stream nsm starttls tls gnutls mailalias smtpmail auth-source
cl-seq eieio eieio-core cl-macs eieio-loaddefs misearch multi-isearch pp
shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp
byte-compile cconv dired dired-loaddefs format-spec rfc822 mml mml-sec
password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr
mail-utils cl-extra thingatpt help-fns radix-tree help-mode cl-loaddefs
cl-lib image-mode easymenu elec-pair time-date mule-util japan-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel dos-w32 ls-lisp disp-table term/w32-win w32-win w32-vars
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript charprop case-table epa-hook jka-cmpr-hook
help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote threads w32notify w32 lcms2 multi-tty make-network-process
emacs)

Memory information:
((conses 16 120602 43655)
 (symbols 48 21333 2)
 (miscs 40 80 287)
 (strings 32 35067 1048)
 (string-bytes 1 899321)
 (vectors 16 16930)
 (vector-slots 8 597369 14750)
 (floats 8 64 249)
 (intervals 56 848 3)
 (buffers 992 21))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37580; Package emacs. (Wed, 02 Oct 2019 15:16:01 GMT) Full text and rfc822 format available.

Message #8 received at 37580 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: 37580 <at> debbugs.gnu.org
Subject: Re: bug#37580: 26.3;
 setting buffer as unibyte temporarily may change buffer contents
Date: Wed, 02 Oct 2019 18:14:43 +0300
> From: ynyaaa <at> gmail.com
> Date: Wed, 02 Oct 2019 18:43:45 +0900
> 
> 
> If a multibyte buffer contains eight-bit character sequences,
> evaluating the form
>  (progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
> may convert them to multibyte characters.
> 
> Afterwards, buffer-undo-list may be inappropriate.
> Undo in the form below changes the position of character '1'.

I don't think this is a bug.  Changing the multibyte-ness of a buffer
really does change the contents.  You should only do that where it
makes sense.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37580; Package emacs. (Sat, 05 Oct 2019 17:19:02 GMT) Full text and rfc822 format available.

Message #11 received at 37580 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37580 <at> debbugs.gnu.org
Subject: Re: bug#37580: 26.3;
 setting buffer as unibyte temporarily may change buffer contents
Date: Sun, 06 Oct 2019 02:18:08 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:
> I don't think this is a bug.  Changing the multibyte-ness of a buffer
> really does change the contents.  You should only do that where it
> makes sense.

Sometimes I find broken utf-8 texts on the Internet.
Some characters are split into surrogate pairs, and each surrogate
character is encoded as if it is a normal BMP character.

utf-8 coding system does not decode such sequences.
Changing multibyte-ness converts them to surrogate characters.
And encode-decode process with utf-16be outputs the intended characeters.

Suppose the character is #x10000,
the correspoding pair is (#xD800 #xDC00).
The miss-encoded sequence is:
  (encode-coding-string "\xD800\xDC00" 'utf-8)
  => "\355\240\200\355\260\200"

It is not decoded with utf-8.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
                        'utf-8)
  => "\355\240\200\355\260\200"

Changing multibyte-ness, the sequence is converted into surrogate
characters.
  (with-temp-buffer
    (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
    (set-buffer-multibyte nil)
    (set-buffer-multibyte t)
    (buffer-string))
  => "\xD800\xDC00"

The surrogate pair can be converted into the original character.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
                        'utf-16be)
  => "\x10000"




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37580; Package emacs. (Sat, 05 Oct 2019 18:57:01 GMT) Full text and rfc822 format available.

Message #14 received at 37580 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: 37580 <at> debbugs.gnu.org
Subject: Re: bug#37580: 26.3;
 setting buffer as unibyte temporarily may change buffer contents
Date: Sat, 05 Oct 2019 21:56:36 +0300
> From: ynyaaa <at> gmail.com
> Cc: 37580 <at> debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
> 
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
> 
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
> 
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
>   (encode-coding-string "\xD800\xDC00" 'utf-8)
>   => "\355\240\200\355\260\200"
> 
> It is not decoded with utf-8.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
>                         'utf-8)
>   => "\355\240\200\355\260\200"
> 
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
>   (with-temp-buffer
>     (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
>     (set-buffer-multibyte nil)
>     (set-buffer-multibyte t)
>     (buffer-string))
>   => "\xD800\xDC00"
> 
> The surrogate pair can be converted into the original character.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
>                         'utf-16be)
>   => "\x10000"

So where's the problem in all this?  AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.

I think the problem is that you enable undo.  So in that case, just
don't do that.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37580; Package emacs. (Mon, 28 Oct 2019 23:28:01 GMT) Full text and rfc822 format available.

Message #17 received at 37580 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37580 <at> debbugs.gnu.org, ynyaaa <at> gmail.com
Subject: Re: bug#37580: 26.3; setting buffer as unibyte temporarily may change
 buffer contents
Date: Tue, 29 Oct 2019 00:26:55 +0100
tags 37580 + notabug
close 37580
thanks

Eli Zaretskii <eliz <at> gnu.org> writes:
[...]
> So where's the problem in all this?  AFAIU, you describe a sequence of
> actions that successfully recovers text in an obscure situation.
>
> I think the problem is that you enable undo.  So in that case, just
> don't do that.

No further comments, so I'm closing this as notabug.

Best regards,
Stefan Kangas




Added tag(s) notabug. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Mon, 28 Oct 2019 23:28:04 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 37580 <at> debbugs.gnu.org and ynyaaa <at> gmail.com Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Mon, 28 Oct 2019 23:28:04 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 26 Nov 2019 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 151 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.