GNU bug report logs - #48324
27.2; hexl-mode duplicates the UTF-8 BOM

Previous Next

Package: emacs;

Reported by: "R. Diez" <rdiezmail-emacs <at> yahoo.de>

Date: Sun, 9 May 2021 21:39:02 UTC

Severity: normal

Found in version 27.2

Fixed in version 29.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 48324 in the body.
You can then email your comments to 48324 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 09 May 2021 21:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "R. Diez" <rdiezmail-emacs <at> yahoo.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 09 May 2021 21:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 9 May 2021 23:38:18 +0200

I think that hexl-mode has problems with the UTF-8 BOM byte sequence at the beginning of a text file. The steps to reproduce this issue are:

Create a text file with a single line with 3 characters: 123

Do a (set-buffer-file-coding-system 'utf-8-with-signature-dos) and save the file.

The file should now have the following contents (8 bytes):

ef bb bf 31 32 33 0d 0a

That is the UTF-8 BOM (ef bb bf), the ASCII digits 1, 2 and 3, and end-of-line sequence (CR LF).

Now change to hexl-mode, place the cursor at the '1' character (31 in hex), call hexl-insert-hex-char, and enter 00 in order to replace the '1' with a 
binary zero (NUL character).

The result is puzzling. Instead of replacing the '1' (31) with NUL (00), the UTF-8 BOM is duplicated, the characters '1' and '2' and '3' have been 
overwritten with the new copy of BOM, character CR has been replaced with NUL, and character LF is intact:

ef bb bf ef bb bf 00 0a

If you save, close and reload the file, it gains one byte, but that is probably not important, just a consequence of having lost the CR character:

ef bb bf ef bb bf 00 0d 0a

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 14:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
Cc: 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 17:17:49 +0300

> Date: Sun, 9 May 2021 23:38:18 +0200
> From:  "R. Diez" via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> I think that hexl-mode has problems with the UTF-8 BOM byte sequence at the beginning of a text file. The steps to reproduce this issue are:
> 
> Create a text file with a single line with 3 characters: 123
> 
> Do a (set-buffer-file-coding-system 'utf-8-with-signature-dos) and save the file.
> 
> The file should now have the following contents (8 bytes):
> 
> ef bb bf 31 32 33 0d 0a
> 
> That is the UTF-8 BOM (ef bb bf), the ASCII digits 1, 2 and 3, and end-of-line sequence (CR LF).
> 
> Now change to hexl-mode, place the cursor at the '1' character (31 in hex), call hexl-insert-hex-char, and enter 00 in order to replace the '1' with a 
> binary zero (NUL character).
> 
> The result is puzzling. Instead of replacing the '1' (31) with NUL (00), the UTF-8 BOM is duplicated, the characters '1' and '2' and '3' have been 
> overwritten with the new copy of BOM, character CR has been replaced with NUL, and character LF is intact:
> 
> ef bb bf ef bb bf 00 0a
> 
> If you save, close and reload the file, it gains one byte, but that is probably not important, just a consequence of having lost the CR character:
> 
> ef bb bf ef bb bf 00 0d 0a

I cannot reproduce this.  Are you sure you are using hexl executable
which came with Emacs 27.2 and not some older/incompatible version?
Are you sure your hexl.el is the one which came with Emacs 27.2?

And on what OS is this (you have omitted all the information collected
by report-emacs-bug, so I cannot know that)?

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 16:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
Cc: 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 19:13:18 +0300

[Please use Reply All to keep the bug address on the CC list.]

> From: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
> Date: Mon, 10 May 2021 16:36:45 +0200
> 
> > I cannot reproduce this.  Are you sure you are using hexl executable
> > which came with Emacs 27.2 and not some older/incompatible version?
> > Are you sure your hexl.el is the one which came with Emacs 27.2?
> 
> I am running Ubuntu MATE 20.04.2, but I built Emacs myself.
> 
> When I ask for help on hexl-mode and follow the link, I end up in this file:
> 
> ~/rdiez/LocalSoftware/Emacs/emacs-27.2-bin/share/emacs/27.2/lisp/hexl.el.gz
> 
> There is no hexl executable on the PATH as far as I can tell, but there is one here:
> 
> /home/rdiez/rdiez/LocalSoftware/Emacs/emacs-27.2-bin/libexec/emacs/27.2/x86_64-pc-linux-gnu/hexl

Strange.  All of the above sounds fine, and yet I cannot reproduce the
problem here.

Is anyone else able to reproduce it?

> The full system information is:
> 
> In GNU Emacs 27.2 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.20, cairo version 1.16.0)
>   of 2021-05-08 built on rdiez4
> Windowing system distributor 'The X.Org Foundation', version 11.0.12009000
> System Description: Ubuntu 20.04.2 LTS
> 
> Recent messages:
> Mark set
> Mark saved where search started
> Mark set
> Making completion list... [2 times]
> Quit [2 times]
> user-error: No window up from selected window
> user-error: You didn’t specify a function symbol
> Type C-x 1 to delete the help window, C-M-v to scroll help.
> Mark set [3 times]
> Making completion list...
> 
> Configured using:
>   'configure 'CFLAGS=-g3 -O3 -march=native -flto' --with-x-toolkit=gtk3
>   --with-cairo --with-xwidgets
>   --prefix=/home/rdiez/rdiez/LocalSoftware/Emacs/emacs-27.2-bin'
> 
> Configured features:
> XPM JPEG TIFF GIF PNG RSVG CAIRO SOUND GPM DBUS GSETTINGS GLIB NOTIFY
> INOTIFY ACL LIBSELINUX GNUTLS LIBXML2 FREETYPE HARFBUZZ M17N_FLT LIBOTF
> ZLIB TOOLKIT_SCROLL_BARS GTK3 X11 XDBE XIM MODULES THREADS XWIDGETS
> LIBSYSTEMD JSON PDUMPER LCMS2 GMP
> 
> Important settings:
>    value of $LC_MONETARY: de_DE.UTF-8
>    value of $LC_NUMERIC: de_DE.UTF-8
>    value of $LC_TIME: de_DE.UTF-8
>    value of $LANG: en_US.UTF-8
>    locale-coding-system: utf-8-unix
> 
> Major mode: Term
> 
> Minor modes in effect:
>    hexl-follow-ascii: t
>    global-undo-tree-mode: t
>    save-place-mode: t
>    which-key-mode: t
>    hes-mode: t
>    tabbar-mwheel-mode: t
>    tabbar-mode: t
>    shell-dirtrack-mode: t
>    recentf-mode: t
>    xterm-mouse-mode: t
>    savehist-mode: t
>    dtrt-indent-global-mode: t
>    override-global-mode: t
>    delete-selection-mode: t
>    show-paren-mode: t
>    tooltip-mode: t
>    global-eldoc-mode: t
>    electric-indent-mode: t
>    mouse-wheel-mode: t
>    menu-bar-mode: t
>    file-name-shadow-mode: t
>    global-font-lock-mode: t
>    font-lock-mode: t
>    blink-cursor-mode: t
>    auto-composition-mode: t
>    auto-encryption-mode: t
>    auto-compression-mode: t
>    buffer-read-only: t
>    column-number-mode: t
>    line-number-mode: t
> 
> Load-path shadows:
> None found.
> 
> Features:
> (shadow sort mail-extr emacsbug message rmc puny rfc822 mml mml-sec epa
> derived epg epg-config gnus-util rmail rmail-loaddefs
> text-property-search mm-decode mm-bodies mm-encode mail-parse rfc2231
> mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
> mm-util mail-prsvr mail-utils jka-compr eieio-opt speedbar sb-image
> ezimage dframe find-func help-fns radix-tree hexl texinfo dired-aux
> find-dired ffap thingatpt grep misearch multi-isearch pp vc-git
> diff-mode perl-mode bm cc-mode cc-fonts cc-guess cc-menus cc-cmds etags
> fileloop generator xref project dired-single undo-tree diff iso-transl
> multi-term saveplace which-key highlight-escape-sequences cc-styles
> cc-align cc-engine cc-vars cc-defs dired dired-loaddefs compile term
> disp-table ehelp tabbar tab-line tramp-cache tramp-sh tramp
> tramp-loaddefs trampver tramp-integration files-x tramp-compat shell
> pcomplete comint ansi-color parse-time iso8601 time-date ls-lisp
> format-spec recentf tree-widget xt-mouse savehist auto-package-update
> dash paradox paradox-menu paradox-commit-list hydra ring lv cus-edit
> wid-edit paradox-execute paradox-github paradox-core spinner pod-mode
> edmacro kmacro cl dtrt-indent advice cl-extra help-mode ascii server
> windmove diminish use-package use-package-ensure use-package-delight
> use-package-diminish use-package-bind-key bind-key easy-mmode
> use-package-core finder-inf delsel paren display-fill-column-indicator
> cua-base cus-start cus-load info package easymenu browse-url
> url-handlers url-parse auth-source cl-seq eieio eieio-core cl-macs
> eieio-loaddefs password-cache json subr-x map url-vars seq byte-opt gv
> bytecomp byte-compile cconv cl-loaddefs cl-lib tooltip eldoc electric
> uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win
> term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
> tabulated-list replace newcomment text-mode elisp-mode lisp-mode
> prog-mode register page tab-bar menu-bar rfn-eshadow isearch timer
> select scroll-bar mouse jit-lock font-lock syntax facemenu font-core
> term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang
> misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
> cp51932 hebrew greek romanian slovak czech european ethiopic indian
> cyrillic chinese composite charscript charprop case-table epa-hook
> jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice loaddefs
> button faces cus-face macroexp files text-properties overlay sha1 md5
> base64 format env code-pages mule custom widget hashtable-print-readable
> backquote threads dbusbind inotify lcms2 dynamic-setting
> system-font-setting font-render-setting xwidget-internal cairo
> move-toolbar gtk x-toolkit x multi-tty make-network-process emacs)
> 
> Memory information:
> ((conses 16 455117 44468)
>   (symbols 48 25447 5)
>   (strings 32 111754 4403)
>   (string-bytes 1 3303255)
>   (vectors 16 43189)
>   (vector-slots 8 1337820 193116)
>   (floats 8 264 219)
>   (intervals 56 17564 0)
>   (buffers 1000 30))
> 
> Regards,
>    rdiez
> 
>

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 16:30:02 GMT) Full text and rfc822 format available.

Message #14 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: "R. Diez" <rdiezmail-emacs <at> yahoo.de>, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 18:28:50 +0200

[Message part 1 (text/plain, inline)]

Eli Zaretskii <eliz <at> gnu.org> writes:

> Is anyone else able to reproduce it?

Yes, the recipe reproduces fine here (Debian/bullseye on the trunk).
Before:

[Message part 2 (image/png, inline)]

[Message part 3 (text/plain, inline)]

Then inserting 00 on the 31:

[Message part 4 (image/png, inline)]

[Message part 5 (text/plain, inline)]

Doubled UTF-8 BOM, and then 00 over the 0d instead of the 31.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 16:51:02 GMT) Full text and rfc822 format available.

Message #17 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: "R. Diez" <rdiezmail-emacs <at> yahoo.de>, Eli Zaretskii <eliz <at> gnu.org>,
 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 18:50:40 +0200

On Mai 10 2021, Lars Ingebrigtsen wrote:

> Doubled UTF-8 BOM, and then 00 over the 0d instead of the 31.

That only happens when you call hexl-mode with the decoded file
contents.  With hexl-find-file it doesn't happen, presumably because it
doesn't decode the file contents.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 17:07:02 GMT) Full text and rfc822 format available.

Message #20 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rdiezmail-emacs <at> yahoo.de, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 20:06:18 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: "R. Diez" <rdiezmail-emacs <at> yahoo.de>,  48324 <at> debbugs.gnu.org
> Date: Mon, 10 May 2021 18:28:50 +0200
> 
> > Is anyone else able to reproduce it?
> 
> Yes, the recipe reproduces fine here (Debian/bullseye on the trunk).

Then I guess you or someone else will have to debug that.  Since the
OS upgrade on fencepost, I cannot run Emacs there, and cannot build a
new one.  I have no idea when this will be fixed (sysadmin for now
thinks my request is not valid), but until then I'm limited to what I
see on Windows.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 17:17:01 GMT) Full text and rfc822 format available.

Message #23 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: rdiezmail-emacs <at> yahoo.de, larsi <at> gnus.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 20:16:11 +0300

> From: Andreas Schwab <schwab <at> linux-m68k.org>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  "R. Diez" <rdiezmail-emacs <at> yahoo.de>,
>   48324 <at> debbugs.gnu.org
> Date: Mon, 10 May 2021 18:50:40 +0200
> 
> On Mai 10 2021, Lars Ingebrigtsen wrote:
> 
> > Doubled UTF-8 BOM, and then 00 over the 0d instead of the 31.
> 
> That only happens when you call hexl-mode with the decoded file
> contents.  With hexl-find-file it doesn't happen, presumably because it
> doesn't decode the file contents.

Ah, so maybe I didn't use the exact recipe.  I did try hexl-mode as
well as hexl-find-file, but maybe I missed something.  Could someone
please post an exact recipe, step by step?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 17:44:01 GMT) Full text and rfc822 format available.

Message #26 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, Andreas Schwab <schwab <at> linux-m68k.org>,
 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 19:43:20 +0200

> Ah, so maybe I didn't use the exact recipe.  I did try hexl-mode as
> well as hexl-find-file, but maybe I missed something.  Could someone
> please post an exact recipe, step by step?

I'll try again:

- I created an empty file with Caja (the MATE Desktop file manager) named Test7.txt . That empty file is 0 bytes long.
- I then dragged the file to Emacs in order to open it.
The default encoding is utf-8-unix (visible on Emacs' status line).
- I pressed my keyboard shortcut for (eval-expression).
- I changed the encoding by manually evaluating this expression:
(set-buffer-file-coding-system 'utf-8-with-signature-dos)
- I then typed in the buffer for Text7.txt the characters "123".
- I saved the buffer with menu "File", option "Save".
- I ran in the minibuffer command hexl-mode, which gives me the hex view for that file:
ef bb bf 31 32 33 0d 0a
- I moved the cursor with the arrow keys to the byte with value "31".
- I ran in the minibuffer command hexl-insert-hex-char, in order to overwrite the 31 with a new value.
- I typed in the minibuffer the hex value "00" (a binary null) and pressed enter.
- In the hex view, the BOM is now duplicated.

Best regards,
  rdiez

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 17:52:01 GMT) Full text and rfc822 format available.

Message #29 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
Cc: larsi <at> gnus.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 20:51:38 +0300

> Cc: larsi <at> gnus.org, 48324 <at> debbugs.gnu.org,
>  Andreas Schwab <schwab <at> linux-m68k.org>
> From: "R. Diez" <rdiezmail-emacs <at> yahoo.de>
> Date: Mon, 10 May 2021 19:43:20 +0200
> 
> - I created an empty file with Caja (the MATE Desktop file manager) named Test7.txt . That empty file is 0 bytes long.
> - I then dragged the file to Emacs in order to open it.
> The default encoding is utf-8-unix (visible on Emacs' status line).
> - I pressed my keyboard shortcut for (eval-expression).
> - I changed the encoding by manually evaluating this expression:
> (set-buffer-file-coding-system 'utf-8-with-signature-dos)
> - I then typed in the buffer for Text7.txt the characters "123".
> - I saved the buffer with menu "File", option "Save".
> - I ran in the minibuffer command hexl-mode, which gives me the hex view for that file:
> ef bb bf 31 32 33 0d 0a
> - I moved the cursor with the arrow keys to the byte with value "31".
> - I ran in the minibuffer command hexl-insert-hex-char, in order to overwrite the 31 with a new value.
> - I typed in the minibuffer the hex value "00" (a binary null) and pressed enter.
> - In the hex view, the BOM is now duplicated.

Thanks, I see it now.

FTR, here's a shorter and easier recipe:

  emacs -Q
  C-x C-f foo.txt RET
  C-x RET f utf-8-with-signature-dos RET
  1 2 3
  C-x C-s
  M-x hexl-mode RET
  M-x hexl-insert-hex-char RET 00 RET

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 10 May 2021 18:06:02 GMT) Full text and rfc822 format available.

Message #32 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: "R. Diez" <rdiezmail-emacs <at> yahoo.de>, larsi <at> gnus.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 10 May 2021 20:05:33 +0200

On Mai 10 2021, Eli Zaretskii wrote:

> FTR, here's a shorter and easier recipe:
>
>   emacs -Q
>   C-x C-f foo.txt RET
>   C-x RET f utf-8-with-signature-dos RET
>   1 2 3
>   C-x C-s
>   M-x hexl-mode RET
>   M-x hexl-insert-hex-char RET 00 RET

I guess the gist is that hexl-mode not only needs to account for the EOL
type, but also for the signature when computing original-point.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Tue, 11 May 2021 12:05:02 GMT) Full text and rfc822 format available.

Message #35 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: rdiezmail-emacs <at> yahoo.de, larsi <at> gnus.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Tue, 11 May 2021 15:04:05 +0300

> From: Andreas Schwab <schwab <at> linux-m68k.org>
> Cc: "R. Diez" <rdiezmail-emacs <at> yahoo.de>,  larsi <at> gnus.org,
>   48324 <at> debbugs.gnu.org
> Date: Mon, 10 May 2021 20:05:33 +0200
> 
> On Mai 10 2021, Eli Zaretskii wrote:
> 
> > FTR, here's a shorter and easier recipe:
> >
> >   emacs -Q
> >   C-x C-f foo.txt RET
> >   C-x RET f utf-8-with-signature-dos RET
> >   1 2 3
> >   C-x C-s
> >   M-x hexl-mode RET
> >   M-x hexl-insert-hex-char RET 00 RET
> 
> I guess the gist is that hexl-mode not only needs to account for the EOL
> type, but also for the signature when computing original-point.

Actually, it turned out that wasn't the main problem.  (It was still a
problem, but the same problem happened in a buffer produced by
hexl-find-file.)  The main problems were that (a) hexl.el handled null
bytes as characters that need to be encoded before inserting them (as
if they were non-ASCII characters), and (b) its handling of non-ASCII
characters when the encoding of the original file used a BOM was
incorrect (because encode-coding-char didn't remove the BOM from the
encoded byte sequence).  By contrast, hexl-find-file visits the file
literally, so its encoding of a null byte was trivially correct.

This should be now fixed on the master branch.

The capability of inserting multibyte characters via Hexl is somewhat
problematic, so I made a point of describing the issues in the
relevant doc strings (because the problems are intrinsic and IMO hard
or impossible to solve in general).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Tue, 11 May 2021 20:38:02 GMT) Full text and rfc822 format available.

Message #38 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rdiezmail-emacs <at> yahoo.de, larsi <at> gnus.org,
 Andreas Schwab <schwab <at> linux-m68k.org>, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Tue, 11 May 2021 16:37:51 -0400

Eli Zaretskii wrote:

> This should be now fixed on the master branch.

The change to encode-coding-char in f3f1947e5b5b causes
test subr-string-limit-coding to fail. Ref eg
https://hydra.nixos.org/build/142879118

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Wed, 12 May 2021 13:51:01 GMT) Full text and rfc822 format available.

Message #41 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Glenn Morris <rgm <at> gnu.org>, Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Wed, 12 May 2021 16:50:15 +0300

> From: Glenn Morris <rgm <at> gnu.org>
> Cc: Andreas Schwab <schwab <at> linux-m68k.org>,  48324 <at> debbugs.gnu.org,  rdiezmail-emacs <at> yahoo.de,  larsi <at> gnus.org
> Date: Tue, 11 May 2021 16:37:51 -0400
> 
> Eli Zaretskii wrote:
> 
> > This should be now fixed on the master branch.
> 
> The change to encode-coding-char in f3f1947e5b5b causes
> test subr-string-limit-coding to fail. Ref eg
> https://hydra.nixos.org/build/142879118

Thanks, I fixed that.

The original test results seemed strange, to say the least: it's as if
we shoot first and draw the target later so that it fits.  E.g., how
can the last 4 bytes of encoding "foóá" with UTF-16 be
"\376\377\000\341", with the 2 first bytes coming from the BOM?

This actually reveals a design flaw in string-limit: we cannot simply
use encode-coding-char to encode the characters one by one.  I added a
FIXME comment to explain why, as I don't currently have any clever
ideas for how to implement it more correctly, except by iterations,
which is inelegant.  Ideas welcome.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sat, 02 Jul 2022 16:15:02 GMT) Full text and rfc822 format available.

Message #44 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Glenn Morris <rgm <at> gnu.org>, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sat, 02 Jul 2022 18:14:39 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> This actually reveals a design flaw in string-limit: we cannot simply
> use encode-coding-char to encode the characters one by one.  I added a
> FIXME comment to explain why, as I don't currently have any clever
> ideas for how to implement it more correctly, except by iterations,
> which is inelegant.  Ideas welcome.

Hm...  do we have some way of knowing that the coding system we're using
is one that should have a BOM?  And a function to remove the BOM?

If we had both, then we could strip the BOM from the individual chars,
and add one to the front.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sat, 02 Jul 2022 16:38:01 GMT) Full text and rfc822 format available.

Message #47 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sat, 02 Jul 2022 19:37:07 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: Glenn Morris <rgm <at> gnu.org>,  schwab <at> linux-m68k.org,  48324 <at> debbugs.gnu.org
> Date: Sat, 02 Jul 2022 18:14:39 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > This actually reveals a design flaw in string-limit: we cannot simply
> > use encode-coding-char to encode the characters one by one.  I added a
> > FIXME comment to explain why, as I don't currently have any clever
> > ideas for how to implement it more correctly, except by iterations,
> > which is inelegant.  Ideas welcome.
> 
> Hm...  do we have some way of knowing that the coding system we're using
> is one that should have a BOM?  And a function to remove the BOM?

The problem is not just with BOM.  The problem will happen with any
coding-system that produces prefix and/or suffix bytes when it encodes
strings.  The FIXME I added mentions ISO-2022 7-bit encodings as
another example.

And then there are coding-system's with pre-write-conversion, and
those can produce any additions they like.

> If we had both, then we could strip the BOM from the individual chars,
> and add one to the front.

AFAIR, what we have now already handles BOM in coding-system's that
are known to produce a BOM.  See encode-coding-char.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 11:09:02 GMT) Full text and rfc822 format available.

Message #50 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 13:08:04 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> The problem is not just with BOM.  The problem will happen with any
> coding-system that produces prefix and/or suffix bytes when it encodes
> strings.  The FIXME I added mentions ISO-2022 7-bit encodings as
> another example.
>
> And then there are coding-system's with pre-write-conversion, and
> those can produce any additions they like.
>
>> If we had both, then we could strip the BOM from the individual chars,
>> and add one to the front.
>
> AFAIR, what we have now already handles BOM in coding-system's that
> are known to produce a BOM.  See encode-coding-char.

Ah, OK, it uses (coding-system-get coding-system :bom) and then
special-cases utf-8 and -16 to remove the BOM.

Hm...  I guess the only reliable solution across all coding systems is
(like your comment in the code says) to drop the encode-every-char and
try encoding strings, and then see whether the result is short enough.
That could be done somewhat efficiently using a binary search.  I'll
have a go at it...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 12:09:02 GMT) Full text and rfc822 format available.

Message #53 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 14:07:43 +0200

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> Hm...  I guess the only reliable solution across all coding systems is
> (like your comment in the code says) to drop the encode-every-char and
> try encoding strings, and then see whether the result is short enough.
> That could be done somewhat efficiently using a binary search.  I'll
> have a go at it...

And while I was at it, I changed it to return complete glyphs, not just
complete code points.

There's a behavioural change, though.  This: 

(string-limit "foóá" 6 t 'utf-16)

Now returns a string with a BOM, whereas previously it didn't.  I think
that's what callers would want, though (the use case here is really
IRC -- you have to limit the max encoded length, but I think if you're
talking utf-16, you want the BOM).

But it's debatable.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug marked as fixed in version 29.1, send any further explanations to 48324 <at> debbugs.gnu.org and "R. Diez" <rdiezmail-emacs <at> yahoo.de> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 03 Jul 2022 12:09:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 13:02:02 GMT) Full text and rfc822 format available.

Message #58 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 16:00:47 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  schwab <at> linux-m68k.org,  48324 <at> debbugs.gnu.org
> Date: Sun, 03 Jul 2022 14:07:43 +0200
> 
> Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> 
> > Hm...  I guess the only reliable solution across all coding systems is
> > (like your comment in the code says) to drop the encode-every-char and
> > try encoding strings, and then see whether the result is short enough.
> > That could be done somewhat efficiently using a binary search.  I'll
> > have a go at it...
> 
> And while I was at it, I changed it to return complete glyphs, not just
> complete code points.
> 
> There's a behavioural change, though.  This: 
> 
> (string-limit "foóá" 6 t 'utf-16)
> 
> Now returns a string with a BOM, whereas previously it didn't.

So you get 6 characters + the BOM?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 13:28:02 GMT) Full text and rfc822 format available.

Message #61 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: larsi <at> gnus.org
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 16:26:54 +0300

> Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
> Date: Sun, 03 Jul 2022 16:00:47 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > From: Lars Ingebrigtsen <larsi <at> gnus.org>
> > Cc: rgm <at> gnu.org,  schwab <at> linux-m68k.org,  48324 <at> debbugs.gnu.org
> > Date: Sun, 03 Jul 2022 14:07:43 +0200
> > 
> > Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> > 
> > > Hm...  I guess the only reliable solution across all coding systems is
> > > (like your comment in the code says) to drop the encode-every-char and
> > > try encoding strings, and then see whether the result is short enough.
> > > That could be done somewhat efficiently using a binary search.  I'll
> > > have a go at it...
> > 
> > And while I was at it, I changed it to return complete glyphs, not just
> > complete code points.
> > 
> > There's a behavioural change, though.  This: 
> > 
> > (string-limit "foóá" 6 t 'utf-16)
> > 
> > Now returns a string with a BOM, whereas previously it didn't.
> 
> So you get 6 characters + the BOM?

I see that it's actually 6 bytes _including_ the BOM.  So I think this
is confusing: if we are going to return a string with the BOM, we
should not count the BOM as part of the LENGTH bytes.  Because if I
requested to get characters which fit into N bytes, I should get those
N bytes of payload.  Or maybe we should have an optional argument to
control whether LENGTH includes or excludes the BOM.

In any case, we should mention this aspect in the doc string, I think.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 13:29:02 GMT) Full text and rfc822 format available.

Message #64 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 15:28:32 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> There's a behavioural change, though.  This: 
>> 
>> (string-limit "foóá" 6 t 'utf-16)
>> 
>> Now returns a string with a BOM, whereas previously it didn't.
>
> So you get 6 characters + the BOM?

Two characters and the BOM (i.e., six bytes).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 13:49:01 GMT) Full text and rfc822 format available.

Message #67 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, larsi <at> gnus.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 15:48:54 +0200

On Jul 03 2022, Eli Zaretskii wrote:

> Or maybe we should have an optional argument to control whether LENGTH
> includes or excludes the BOM.

utf-8-with-signature?

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Sun, 03 Jul 2022 13:53:02 GMT) Full text and rfc822 format available.

Message #70 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: rgm <at> gnu.org, larsi <at> gnus.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 16:51:46 +0300

> From: Andreas Schwab <schwab <at> linux-m68k.org>
> Cc: larsi <at> gnus.org,  rgm <at> gnu.org,  48324 <at> debbugs.gnu.org
> Date: Sun, 03 Jul 2022 15:48:54 +0200
> 
> On Jul 03 2022, Eli Zaretskii wrote:
> 
> > Or maybe we should have an optional argument to control whether LENGTH
> > includes or excludes the BOM.
> 
> utf-8-with-signature?

No, I mean when the CODING-SYSTEM argument requires a BOM (or a
shift-in and shift-out sequences).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 04 Jul 2022 10:35:01 GMT) Full text and rfc822 format available.

Message #73 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 04 Jul 2022 12:34:29 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> I see that it's actually 6 bytes _including_ the BOM.  So I think this
> is confusing: if we are going to return a string with the BOM, we
> should not count the BOM as part of the LENGTH bytes.  Because if I
> requested to get characters which fit into N bytes, I should get those
> N bytes of payload.  Or maybe we should have an optional argument to
> control whether LENGTH includes or excludes the BOM.

It the caller has asked for a max number of bytes in a coding system
that includes a BOM, then the BOM has to be counted -- otherwise the
bytes won't fit into whatever field the protocol they're using limits
the string to.

However, utf-16 is in a slightly special situation here, since the byte
order is often implied, and people use utf-16 instead of
utf-16be-with-signature (or something), and utf-16 (in Emacs) is defined
to have a BOM.  (And we don't have a -without-signature variant, do we?)

> In any case, we should mention this aspect in the doc string, I think.

Yes.  But should we have -without-signature variants for utf-16?  Then
the doc string could recommend using that if the caller wants BOM-less
bytes.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Mon, 04 Jul 2022 11:32:01 GMT) Full text and rfc822 format available.

Message #76 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 04 Jul 2022 14:31:01 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  schwab <at> linux-m68k.org,  48324 <at> debbugs.gnu.org
> Date: Mon, 04 Jul 2022 12:34:29 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > I see that it's actually 6 bytes _including_ the BOM.  So I think this
> > is confusing: if we are going to return a string with the BOM, we
> > should not count the BOM as part of the LENGTH bytes.  Because if I
> > requested to get characters which fit into N bytes, I should get those
> > N bytes of payload.  Or maybe we should have an optional argument to
> > control whether LENGTH includes or excludes the BOM.
> 
> It the caller has asked for a max number of bytes in a coding system
> that includes a BOM, then the BOM has to be counted -- otherwise the
> bytes won't fit into whatever field the protocol they're using limits
> the string to.

You obviously have a very specific use case in mind.  But there are
others.  Moreover, UTF and BOM is a special case, where the prefix is
known in advance.  Other encodings, notably from the ISO-2022 family,
are harder because the exact shift-ion sequence is not always easy to
guess.

Which is why I thought a way to control this aspect could be needed.
But we could just document the subtlety and wait for someone to come
up with a practical scenario where it would be needed.

> (And we don't have a -without-signature variant, do we?)

We do: utf-16le and utf-16be.

> > In any case, we should mention this aspect in the doc string, I think.
> 
> Yes.  But should we have -without-signature variants for utf-16?  Then
> the doc string could recommend using that if the caller wants BOM-less
> bytes.

See above.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48324; Package emacs. (Tue, 05 Jul 2022 11:09:01 GMT) Full text and rfc822 format available.

Message #79 received at 48324 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org
Subject: Re: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Tue, 05 Jul 2022 13:08:08 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> You obviously have a very specific use case in mind.  But there are
> others.

I don't see any other use cases for requesting a specific number of
bytes than having some restrictions for the usage of that selection of
bytes. 

>> (And we don't have a -without-signature variant, do we?)
>
> We do: utf-16le and utf-16be.

I've now mentioned this in the doc string.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 02 Aug 2022 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 267 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #48324 27.2; hexl-mode duplicates the UTF-8 BOM

GNU bug report logs - #48324
27.2; hexl-mode duplicates the UTF-8 BOM