GNU bug report logs - #60750
29.0.60; encode-coding-char fails for utf-8-auto coding system

Previous Next

Package: emacs;

Reported by: Robert Pluim <rpluim <at> gmail.com>

Date: Thu, 12 Jan 2023 09:09:02 UTC

Severity: normal

Found in version 29.0.60

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60750 in the body.
You can then email your comments to 60750 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#60750; Package emacs. (Thu, 12 Jan 2023 09:09:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Robert Pluim <rpluim <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 12 Jan 2023 09:09:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.0.60; encode-coding-char fails for utf-8-auto coding system
Date: Thu, 12 Jan 2023 10:08:31 +0100
src/emacs -Q
M-x toggle-debug-on-error
M-: (setq buffer-file-coding-system 'utf-8-auto)
C-b
C-u C-x =

=>
Debugger entered--Lisp error: (args-out-of-range "))" 3 1)
  encode-coding-char(41 utf-8-auto ascii)
  describe-char(189)
  what-cursor-position((4))

This is because utf-8-auto has a non-nil :bom property:

(define-coding-system 'utf-8-auto
  "UTF-8 (auto-detect signature (BOM))"
  :coding-type 'utf-8
  :mnemonic ?U
  :charset-list '(unicode)
  :bom '(utf-8-with-signature . utf-8))

and `encode-coding-char' does this:

        ;; We also need to exclude the leading 2 or 3 bytes if they
        ;; come from a BOM.
        (setq i0
              (if bom-p
                  (cond
                   ((eq (coding-system-type coding-system) 'utf-8)
                    3)
                   ((eq (coding-system-type coding-system) 'utf-16)
                    2)
                   (t 0))
                0))
	(substring enc2 i0 i2)))))

Iʼm not sure if this needs fixing, but it was surprising, and the
docstring of `define-coding-system' didnʼt make it clear to me whether
a BOM should have been produced here or not. (Iʼm willing to be told
that buffer-file-coding-system shouldnʼt be 'utf-8-auto, but I never
set that explicitly as far as I know 😀)

Thanks

Robert

In GNU Emacs 29.0.60 (build 14, x86_64-pc-linux-gnu, GTK+ Version
 3.24.24, cairo version 1.16.0) of 2023-01-12 built on rltb
Repository revision: f4f30ff4c44dcfdf780f1981aa541af713f2805f
Repository branch: emacs-29
System Description: Debian GNU/Linux 11 (bullseye)

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 M17N_FLT MODULES NOTIFY
INOTIFY PDUMPER PNG RSVG SECCOMP SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS WEBP X11 XDBE XIM XINPUT2 XPM GTK3 ZLIB




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60750; Package emacs. (Thu, 12 Jan 2023 12:34:01 GMT) Full text and rfc822 format available.

Message #8 received at 60750 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: 60750 <at> debbugs.gnu.org
Subject: Re: bug#60750: 29.0.60;
 encode-coding-char fails for utf-8-auto coding system
Date: Thu, 12 Jan 2023 14:32:52 +0200
> From: Robert Pluim <rpluim <at> gmail.com>
> Date: Thu, 12 Jan 2023 10:08:31 +0100
> 
> 
> src/emacs -Q
> M-x toggle-debug-on-error
> M-: (setq buffer-file-coding-system 'utf-8-auto)
> C-b
> C-u C-x =
> 
> =>
> Debugger entered--Lisp error: (args-out-of-range "))" 3 1)
>   encode-coding-char(41 utf-8-auto ascii)
>   describe-char(189)
>   what-cursor-position((4))
> 
> This is because utf-8-auto has a non-nil :bom property:
> 
> (define-coding-system 'utf-8-auto
>   "UTF-8 (auto-detect signature (BOM))"
>   :coding-type 'utf-8
>   :mnemonic ?U
>   :charset-list '(unicode)
>   :bom '(utf-8-with-signature . utf-8))

Right.  This is a very old bug in encoding with utf-8 family of
encoding which has a :bom property that is a cons cell.  The fix is
simple, but I wonder what will this break out there.  So:

> Iʼm not sure if this needs fixing, but it was surprising, and the
> docstring of `define-coding-system' didnʼt make it clear to me whether
> a BOM should have been produced here or not.

Actually, the doc string is clear:

  If the value is a cons cell, on decoding, check the first two bytes.
  If they are 0xFE 0xFF, use the car part coding system of the value.
  If they are 0xFF 0xFE, use the cdr part coding system of the value.
  Otherwise, treat them as bytes for a normal character.  On encoding,
  produce BOM bytes according to the value of ‘:endian’.

Note the last sentence: it should unconditionally produce the BOM on
encoding.  Which is what we do in your scenario.

> (Iʼm willing to be told that buffer-file-coding-system shouldnʼt be
> 'utf-8-auto, but I never set that explicitly as far as I know 😀)

Who does set utf-8-auto? where did you originally bump into this?
This is an obscure coding-system, and the fix to make it work as
documented will produce an incompatible change in behavior.  So before
I decide whether to make the change and on what branch, I'd like to
know how in the world did you encounter this.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60750; Package emacs. (Thu, 12 Jan 2023 13:45:02 GMT) Full text and rfc822 format available.

Message #11 received at 60750 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 60750 <at> debbugs.gnu.org
Subject: Re: bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto
 coding system
Date: Thu, 12 Jan 2023 14:44:29 +0100
>>>>> On Thu, 12 Jan 2023 14:32:52 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    Eli> Actually, the doc string is clear:

    Eli>   If the value is a cons cell, on decoding, check the first two bytes.
    Eli>   If they are 0xFE 0xFF, use the car part coding system of the value.
    Eli>   If they are 0xFF 0xFE, use the cdr part coding system of the value.
    Eli>   Otherwise, treat them as bytes for a normal character.  On encoding,
    Eli>   produce BOM bytes according to the value of ‘:endian’.

    Eli> Note the last sentence: it should unconditionally produce the BOM on
    Eli> encoding.  Which is what we do in your scenario.

Ah, I misread that as "depending on the value of ':endian'"

One minor nit, the description for ':endian' says:

    `:endian'

    VALUE must be `big' or `little' specifying big-endian and
    little-endian respectively.  The default value is `big'.

    This attribute is meaningful only when `:coding-type' is `utf-16'.

That last sentence seems untrue, as ':endian' is meaningful for
'utf-8-auto'

    >> (Iʼm willing to be told that buffer-file-coding-system shouldnʼt be
    >> 'utf-8-auto, but I never set that explicitly as far as I know 😀)

    Eli> Who does set utf-8-auto? where did you originally bump into this?
    Eli> This is an obscure coding-system, and the fix to make it work as
    Eli> documented will produce an incompatible change in behavior.  So before
    Eli> I decide whether to make the change and on what branch, I'd like to
    Eli> know how in the world did you encounter this.

Itʼs entirely my own fault:

The file where I noticed this is shared between a GNU/Linux and a
macOS machine, which means I foolishly added the following a year ago,
even though itʼs unnecessary (perhaps I was thinking I was going to be
sharing it with a Windows machine?):

    ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

I think that means we can leave the code as it is.

Robert
-- 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60750; Package emacs. (Thu, 12 Jan 2023 14:05:01 GMT) Full text and rfc822 format available.

Message #14 received at 60750 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: 60750 <at> debbugs.gnu.org
Subject: Re: bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto
 coding system
Date: Thu, 12 Jan 2023 16:04:07 +0200
> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: 60750 <at> debbugs.gnu.org
> Date: Thu, 12 Jan 2023 14:44:29 +0100
> 
> One minor nit, the description for ':endian' says:
> 
>     `:endian'
> 
>     VALUE must be `big' or `little' specifying big-endian and
>     little-endian respectively.  The default value is `big'.
> 
>     This attribute is meaningful only when `:coding-type' is `utf-16'.
> 
> That last sentence seems untrue, as ':endian' is meaningful for
> 'utf-8-auto'

That depends on what you mean by "meaningful".  What it wants to say
is that it's meaningless to change the value of this property for any
coding-system other than UTF-16.

>     Eli> Who does set utf-8-auto? where did you originally bump into this?
>     Eli> This is an obscure coding-system, and the fix to make it work as
>     Eli> documented will produce an incompatible change in behavior.  So before
>     Eli> I decide whether to make the change and on what branch, I'd like to
>     Eli> know how in the world did you encounter this.
> 
> Itʼs entirely my own fault:
> 
> The file where I noticed this is shared between a GNU/Linux and a
> macOS machine, which means I foolishly added the following a year ago,
> even though itʼs unnecessary (perhaps I was thinking I was going to be
> sharing it with a Windows machine?):
> 
>     ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

So you thought the "-auto" part was about the EOL format?

> I think that means we can leave the code as it is.

??? "As it is" means this coding-system behaves contrary to
documentation: it should produce BOM on encoding.  Leaving it as is
doesn't sound TRT, so I'd like to have this fixed.  From your
description, it sounds like you bumped into this by mistake, and I see
only one other use of it -- in the test suite.  So I'm inclined to
installing this on the emacs-29 release branch.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#60750; Package emacs. (Thu, 12 Jan 2023 14:29:02 GMT) Full text and rfc822 format available.

Message #17 received at 60750 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 60750 <at> debbugs.gnu.org
Subject: Re: bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto
 coding system
Date: Thu, 12 Jan 2023 15:28:49 +0100
>>>>> On Thu, 12 Jan 2023 16:04:07 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    >> From: Robert Pluim <rpluim <at> gmail.com>
    >> Cc: 60750 <at> debbugs.gnu.org
    >> Date: Thu, 12 Jan 2023 14:44:29 +0100
    >> 
    >> One minor nit, the description for ':endian' says:
    >> 
    >> `:endian'
    >> 
    >> VALUE must be `big' or `little' specifying big-endian and
    >> little-endian respectively.  The default value is `big'.
    >> 
    >> This attribute is meaningful only when `:coding-type' is `utf-16'.
    >> 
    >> That last sentence seems untrue, as ':endian' is meaningful for
    >> 'utf-8-auto'

    Eli> That depends on what you mean by "meaningful".  What it wants to say
    Eli> is that it's meaningless to change the value of this property for any
    Eli> coding-system other than UTF-16.

OK

    Eli> Who does set utf-8-auto? where did you originally bump into this?
    Eli> This is an obscure coding-system, and the fix to make it work as
    Eli> documented will produce an incompatible change in behavior.  So before
    Eli> I decide whether to make the change and on what branch, I'd like to
    Eli> know how in the world did you encounter this.
    >> 
    >> Itʼs entirely my own fault:
    >> 
    >> The file where I noticed this is shared between a GNU/Linux and a
    >> macOS machine, which means I foolishly added the following a year ago,
    >> even though itʼs unnecessary (perhaps I was thinking I was going to be
    >> sharing it with a Windows machine?):
    >> 
    >> ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

    Eli> So you thought the "-auto" part was about the EOL format?

yes. Iʼm having a reading incomprehension day, obviously (just like a
year ago when I made the change originally).

    >> I think that means we can leave the code as it is.

    Eli> ??? "As it is" means this coding-system behaves contrary to
    Eli> documentation: it should produce BOM on encoding.  Leaving it as is
    Eli> doesn't sound TRT, so I'd like to have this fixed.  From your
    Eli> description, it sounds like you bumped into this by mistake, and I see
    Eli> only one other use of it -- in the test suite.  So I'm inclined to
    Eli> installing this on the emacs-29 release branch.

Oh, I thought you were proposing *not* to fix it at all, since itʼs
such an obscure coding system. I have no opinion on where a fix should
go: Iʼm not going to be using that coding system again.

Robert
-- 




Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Thu, 12 Jan 2023 14:40:01 GMT) Full text and rfc822 format available.

Notification sent to Robert Pluim <rpluim <at> gmail.com>:
bug acknowledged by developer. (Thu, 12 Jan 2023 14:40:02 GMT) Full text and rfc822 format available.

Message #22 received at 60750-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: 60750-done <at> debbugs.gnu.org
Subject: Re: bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto
 coding system
Date: Thu, 12 Jan 2023 16:39:07 +0200
> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: 60750 <at> debbugs.gnu.org
> Date: Thu, 12 Jan 2023 15:28:49 +0100
> 
>     >> I think that means we can leave the code as it is.
> 
>     Eli> ??? "As it is" means this coding-system behaves contrary to
>     Eli> documentation: it should produce BOM on encoding.  Leaving it as is
>     Eli> doesn't sound TRT, so I'd like to have this fixed.  From your
>     Eli> description, it sounds like you bumped into this by mistake, and I see
>     Eli> only one other use of it -- in the test suite.  So I'm inclined to
>     Eli> installing this on the emacs-29 release branch.
> 
> Oh, I thought you were proposing *not* to fix it at all, since itʼs
> such an obscure coding system. I have no opinion on where a fix should
> go: Iʼm not going to be using that coding system again.

OK.  So I've installed the fix on the emacs-29 branch, and I'm boldly
closing this bug.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 10 Feb 2023 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 75 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.