GNU bug report logs - #20623
XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Previous Next

Package: emacs;

Reported by: Simon Ledergerber <sledergerber <at> gmx.net>

Date: Thu, 21 May 2015 18:53:02 UTC

Severity: normal

Found in version 26.1

Fixed in version 26.2

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20623 in the body.
You can then email your comments to 20623 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Thu, 21 May 2015 18:53:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Simon Ledergerber <sledergerber <at> gmx.net>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 21 May 2015 18:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Simon Ledergerber <sledergerber <at> gmx.net>
To: bug-gnu-emacs <at> gnu.org
Subject: XML and HTML files with encoding/charset="utf-8" declaration loose
 BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Thu, 21 May 2015 20:50:58 +0200

Hi

When I was editing XHTML and HTML files, I wanted to make sure the BOM 
was written out to the file in order to make it easier for the browser 
to detect the UTF-8 encoding. Therefore I changed the coding system for 
the file buffer to utf-8-with-signature-dos (since I am working on a 
Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't 
report UTF-8 as the file's encoding. Having checked the hexdump of my 
(X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> 
(even if commented out, see later below) or <?xml version="1.0" 
encoding="utf-8"?> declaration, Emacs switches the file coding system to 
utf-8, when it saves the file, even if utf-8-with-signature was 
specified explicitly before. This appears to me as a bug, because there 
is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to 
report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on 
Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed 
out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it 
test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single 
quotes).

- Switch the encoding system to include the BOM: C-x RET f 
utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the 
encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with 
C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax 
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written 
at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer 
has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
    </body>
</html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is 
really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta 
charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos 
and the BOM has vanished, even if it is just a comment and has no effect!

Regards

Simon

P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
 of 2015-04-10 on desktop-new
Configured using:
 `configure
 --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
 --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
 --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
 -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
 -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
 CPPFLAGS= LDFLAGS='

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
 (symbols 48 17091 0)
 (miscs 40 73 387)
 (strings 32 11233 4887)
 (string-bytes 1 291872)
 (vectors 16 7587)
 (vector-slots 8 342125 27930)
 (floats 8 57 393)
 (intervals 56 834 26)
 (buffers 960 21))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Thu, 21 May 2015 19:49:02 GMT) Full text and rfc822 format available.

Message #8 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Simon Ledergerber <sledergerber <at> gmx.net>
Cc: 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Thu, 21 May 2015 22:48:31 +0300

> Date: Thu, 21 May 2015 20:50:58 +0200
> From: Simon Ledergerber <sledergerber <at> gmx.net>
> 
> When I was editing XHTML and HTML files, I wanted to make sure the BOM 
> was written out to the file in order to make it easier for the browser 
> to detect the UTF-8 encoding. Therefore I changed the coding system for 
> the file buffer to utf-8-with-signature-dos (since I am working on a 
> Windows System) before saving the file.
> 
> After some time I got surprised because the browser (IE11), didn't 
> report UTF-8 as the file's encoding. Having checked the hexdump of my 
> (X)HTML file, I saw the BOM was definitely missing.
> 
> Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> 
> (even if commented out, see later below) or <?xml version="1.0" 
> encoding="utf-8"?> declaration, Emacs switches the file coding system to 
> utf-8, when it saves the file, even if utf-8-with-signature was 
> specified explicitly before. This appears to me as a bug, because there 
> is no way anymore to restore the BOM using Emacs.

What would you expect Emacs to do instead?  It just obeys the stated
encoding, which says nothing about the BOM.  How can Emacs know when
to use utf-8 and when utf-8-with-signature?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Fri, 22 May 2015 07:12:02 GMT) Full text and rfc822 format available.

Message #11 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Simon Ledergerber <sledergerber <at> gmx.net>
Cc: 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 22 May 2015 10:11:31 +0300

[Please don't remove the bug address from the CC list, so that this
discussion is recorded in the bug data base.]

> Date: Thu, 21 May 2015 22:49:47 +0200
> From: Simon Ledergerber <sledergerber <at> gmx.net>
> 
>  From the documentation I understand that utf-8 is without BOM and 
> utf-8-with-signature is with BOM. Maybe I am wrong and should rather 
> understand that utf-8 is auto-detect. But then there is something like 
> utf-8-without-signature missing to specify explicitly that no BOM is 
> desired.
> 
> In my opinion, it is correct when Emacs prefers utf-8 over 
> utf-8-with-signature when it opens a file without BOM that can still be 
> recognized as UTF-8.
> 
> However when a file is opened with a BOM already present, it should 
> stick to the utf-8-with-signature coding system, because the BOM "EF BB 
> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example, 
> there is a different BOM byte pattern. There are other coding systems 
> which do not have a BOM at all.)

What do you mean by "stick to"?  When I try visiting an XML file that
is encoded with BOM, Emacs decodes the file correctly, and the value
of buffer-file-coding-system is utf-8-with-signature.  Isn't that what
you want?  If that's what you want, but it doesn't happen for you,
please try in "emacs -Q".  It's possible that the default you set:

  (setq-default buffer-file-coding-system 'utf-8-dos)

is the reason for what you see.  (I don't understand why you need such
a default, and it sounds like a bad idea to me.)

> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be 
> able to change the coding system.  For example, if I specify utf-8-dos, 
> the BOM should be removed, if one was present, and CR LF should be 
> inserted for EOL. On the other side, if I choose 
> utf-8-with-signature-unix, a BOM should be written and LF be taken for 
> EOL. (The conversion between DOS and Unix works, just the BOM is the 
> problem.)
> 
> I have found a link, where this topic was already discussed, but it 
> didn't help me further:
> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
> 
> In that post Vebjorn Ljosa asked exactly the question I have. Richard 
> Hoskins replies with the answer to change the coding system with C-x 
> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me - 
> after doing a change in the file and saving, it got back to utf-8 
> automatically - that's why I have filed the bug.

That's not how you force a file to be saved in a specific encoding.
You should do this instead:

  C-x RET c utf-8-with-signature RET C-x C-s

The "C-x RET c" prefix forces the next Emacs operation to use the
specified encoding.  In this case, Emacs will ask for confirmation,
because the encoding you specified is different from what the XML
comment says.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Fri, 22 May 2015 13:22:02 GMT) Full text and rfc822 format available.

Message #14 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Simon Ledergerber <sledergerber <at> gmx.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration
 loose BOM;	Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 22 May 2015 15:21:00 +0200

Hello Eli

I have done some more research to answer your questions. You will find 
the details of my statement at the end of this mail.

On 22.05.2015 09:11, Eli Zaretskii wrote:
> [Please don't remove the bug address from the CC list, so that this
> discussion is recorded in the bug data base.]
>
>> Date: Thu, 21 May 2015 22:49:47 +0200
>> From: Simon Ledergerber <sledergerber <at> gmx.net>
>>
>>   From the documentation I understand that utf-8 is without BOM and
>> utf-8-with-signature is with BOM. Maybe I am wrong and should rather
>> understand that utf-8 is auto-detect. But then there is something like
>> utf-8-without-signature missing to specify explicitly that no BOM is
>> desired.
>>
>> In my opinion, it is correct when Emacs prefers utf-8 over
>> utf-8-with-signature when it opens a file without BOM that can still be
>> recognized as UTF-8.
>>
>> However when a file is opened with a BOM already present, it should
>> stick to the utf-8-with-signature coding system, because the BOM "EF BB
>> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example,
>> there is a different BOM byte pattern. There are other coding systems
>> which do not have a BOM at all.)
> What do you mean by "stick to"?  When I try visiting an XML file that
> is encoded with BOM, Emacs decodes the file correctly, and the value
> of buffer-file-coding-system is utf-8-with-signature.  Isn't that what
> you want?  If that's what you want, but it doesn't happen for you,
> please try in "emacs -Q".  It's possible that the default you set:
>
>    (setq-default buffer-file-coding-system 'utf-8-dos)
>
> is the reason for what you see.  (I don't understand why you need such
> a default, and it sounds like a bad idea to me.)
You're right. When I open a file that was really saved with BOM, Emacs 
detects its encoding correctly, i. e. utf-8-with-signature-dos. But when 
I change the content and save with C-x C-s, the encoding changes to 
utf-8-dos and the BOM gets lost. Even when I start Emacs with -Q. This 
is the actual bug.
>
>> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be
>> able to change the coding system.  For example, if I specify utf-8-dos,
>> the BOM should be removed, if one was present, and CR LF should be
>> inserted for EOL. On the other side, if I choose
>> utf-8-with-signature-unix, a BOM should be written and LF be taken for
>> EOL. (The conversion between DOS and Unix works, just the BOM is the
>> problem.)
>>
>> I have found a link, where this topic was already discussed, but it
>> didn't help me further:
>> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
>>
>> In that post Vebjorn Ljosa asked exactly the question I have. Richard
>> Hoskins replies with the answer to change the coding system with C-x
>> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me -
>> after doing a change in the file and saving, it got back to utf-8
>> automatically - that's why I have filed the bug.
> That's not how you force a file to be saved in a specific encoding.
> You should do this instead:
>
>    C-x RET c utf-8-with-signature RET C-x C-s
>
> The "C-x RET c" prefix forces the next Emacs operation to use the
> specified encoding.  In this case, Emacs will ask for confirmation,
> because the encoding you specified is different from what the XML
> comment says.
>
This is true and it worked for me. Please see below for further 
explanations.

Summary:
- C-x RET c utf-8-with-signature RET C-x C-s is a good workaround, 
because it really forces the file being written with BOM. In order to 
have an effect however, the file must be dirty, i. e. there must be a 
pending change. But before the command completes in this case, the 
prompt "Selected encoding utf-8-with-signature-dos disagrees with 
utf-8-dos specified by file contents.  Really save (else edit coding 
cookies and try again)? (yes or no)" appears. I think this is what you 
mean with your sentence: "In this case, Emacs will ask for confirmation, 
because the encoding you specified is different from what the XML 
comment says."

- But consider the following: The encoding in the XML declaration or in 
the HTML <meta charset="utf-8"> just specifies UTF-8 (or another 
encoding). It doesn't say anything about the presence or absence of the 
BOM. Therefore an editor detecting and deciding about the file's 
encoding should not rely on this information only.

- When such a file, which was saved successfully with BOM, is closed and 
reopened again, Emacs detects its encoding correctly, say 
utf-8-with-signature-dos.

- However, when I change the file content and save it again just with 
C-x C-s (without C-x RET c ... first!), then it changes back to 
utf-8-dos. Yes, even if I start emacs with -Q! (That's the point.)

- I do not fully understand the criterion for and the magic behind how 
Emacs chooses the file encoding when I do C-x C-s. But I was able to 
reproduce it several times by applying the procedures given in the bug 
report, even when -Q is on. As we already have stated above, this could 
be because Emacs favors (and forces) utf-8 whenever it sees something 
like XML or HTML that might be UTF-8-encoded.

-> Conclusion: C-x RET c utf-8-with-signature RET C-x C-s is a good way 
to force the file being written as I want. But what I still do not 
understand: When I open a file with BOM and Emacs recognizes that, why 
does it change the encoding silently to drop the BOM when I regularly 
save with C-x C-s - and this even without giving me a notice or warning?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Fri, 22 May 2015 15:23:02 GMT) Full text and rfc822 format available.

Message #17 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Simon Ledergerber <sledergerber <at> gmx.net>, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 22 May 2015 11:22:27 -0400

> What would you expect Emacs to do instead?  It just obeys the stated
> encoding, which says nothing about the BOM.  How can Emacs know when
> to use utf-8 and when utf-8-with-signature?

To the extent that Emacs has seen the BOM when opening the file, it
would make sense for Emacs to try and preserve this detail.  IOW the
utf-8 annotation in the XML metadata shouldn't mean "use the utf-8
coding system" but "use a coding system compatible with utf-8".  So if
the coding system is already compatible with utf-8
(e.g. utf-8-with-signature), we should simply keep using that rather
than switch to the utf-8 coding-system.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Fri, 22 May 2015 15:28:02 GMT) Full text and rfc822 format available.

Message #20 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: sledergerber <at> gmx.net, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 22 May 2015 18:26:57 +0300

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: Simon Ledergerber <sledergerber <at> gmx.net>,  20623 <at> debbugs.gnu.org
> Date: Fri, 22 May 2015 11:22:27 -0400
> 
> > What would you expect Emacs to do instead?  It just obeys the stated
> > encoding, which says nothing about the BOM.  How can Emacs know when
> > to use utf-8 and when utf-8-with-signature?
> 
> To the extent that Emacs has seen the BOM when opening the file, it
> would make sense for Emacs to try and preserve this detail.

It does.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Fri, 22 May 2015 21:52:02 GMT) Full text and rfc822 format available.

Message #23 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: sledergerber <at> gmx.net, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 22 May 2015 17:51:07 -0400

>> > What would you expect Emacs to do instead?  It just obeys the stated
>> > encoding, which says nothing about the BOM.  How can Emacs know when
>> > to use utf-8 and when utf-8-with-signature?
>> To the extent that Emacs has seen the BOM when opening the file, it
>> would make sense for Emacs to try and preserve this detail.
> It does.

While there are cases where it does, this bug report is about a case
where it doesn't, IIUC.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 23 May 2015 06:45:03 GMT) Full text and rfc822 format available.

Message #26 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: sledergerber <at> gmx.net, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 23 May 2015 09:44:12 +0300

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: sledergerber <at> gmx.net,  20623 <at> debbugs.gnu.org
> Date: Fri, 22 May 2015 17:51:07 -0400
> 
> >> > What would you expect Emacs to do instead?  It just obeys the stated
> >> > encoding, which says nothing about the BOM.  How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
> 
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.

AFAIU, that happened because the user has this in ~/.emacs:

  (setq-default buffer-file-coding-system 'utf-8-dos)

IMO, this bad customization should be removed, and then the problem
will go away.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 23 May 2015 17:12:01 GMT) Full text and rfc822 format available.

Message #29 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Simon Ledergerber <sledergerber <at> gmx.net>
To: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 20623 <at> debbugs.gnu.org
Subject: RE: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 23 May 2015 19:11:15 +0200

[Message part 1 (text/plain, inline)]

As already mentioned in my last post, even when I started Emacs with the option -Q, which should opt out my customizations, it made no difference. So naturally, the source of the problem will be somewhere else.

-----Original Message-----
From: "Eli Zaretskii" <eliz <at> gnu.org>
Sent: ‎23.‎05.‎2015 08:44
To: "Stefan Monnier" <monnier <at> iro.umontreal.ca>
Cc: "sledergerber <at> gmx.net" <sledergerber <at> gmx.net>; "20623 <at> debbugs.gnu.org" <20623 <at> debbugs.gnu.org>
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"	declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: sledergerber <at> gmx.net,  20623 <at> debbugs.gnu.org
> Date: Fri, 22 May 2015 17:51:07 -0400
> 
> >> > What would you expect Emacs to do instead?  It just obeys the stated
> >> > encoding, which says nothing about the BOM.  How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
> 
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.

AFAIU, that happened because the user has this in ~/.emacs:

  (setq-default buffer-file-coding-system 'utf-8-dos)

IMO, this bad customization should be removed, and then the problem
will go away.

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 23 May 2015 17:22:02 GMT) Full text and rfc822 format available.

Message #32 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Simon Ledergerber <sledergerber <at> gmx.net>
Cc: monnier <at> iro.umontreal.ca, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 23 May 2015 20:20:56 +0300

> Cc: <20623 <at> debbugs.gnu.org>
> From: Simon Ledergerber <sledergerber <at> gmx.net>
> Date: Sat, 23 May 2015 19:11:15 +0200
> 
> As already mentioned in my last post, even when I started Emacs with the option
> -Q, which should opt out my customizations, it made no difference. So
> naturally, the source of the problem will be somewhere else.

Doesn't happen to me.  So please post the file you used and the exact
sequence of steps, starting from 'emacs -Q", to reproduce the problem.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Wed, 12 Oct 2016 21:46:02 GMT) Full text and rfc822 format available.

Message #35 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Alain Schneble <a.s <at> realize.ch>
To: Simon Ledergerber <sledergerber <at> gmx.net>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Wed, 12 Oct 2016 23:44:57 +0200

I'm joining this discussion and would like to report a recipe to
reproduce this issue on Windows:

- emacs -Q
- C-x C-f utf-8-bom-test.xml
- Enter the following text in the new buffer:
<?xml version="1.0" encoding="utf-8"?>
<root></root>
- C-x RET c utf-8-with-signature-dos C-x C-s yes RET
- C-x k RET
- C-x C-f utf-8-bom-test.xml
- M-: buffer-file-coding-system
  => utf-8-with-signature-dos
- Change buffer content, e.g. add some text to the root element:
<?xml version="1.0" encoding="utf-8"?>
<root>test</root>
- C-x C-s
- M-: buffer-file-coding-system
  => utf-8-dos
  (expected coding system: utf-8-with-signature-dos)

As it was already mentioned in this thread, just by visiting the file,
then changing and saving the buffer, the BOM gets lost.  This is due to
select-safe-coding-system (called by choose_write_coding_system) fully
trusting the coding system identified by find-auto-coding.  So far so
good.  The latter eventually calls auto-coding-functions which in turn
calls the built-in sgml-xml-auto-coding-function which I think should
take into account some context to enrich the derived coding system with
a signature if needed.  Similar to what select-safe-coding-system does
to enrich the coding with the proper eol-type.

Does that make sense to you?  If so, I'll try to come up with a patch
that enhances sgml-xml-auto-coding-function to take into account
buffer-file-coding-system (buffer + default value) in case it carries
the same text-conversion but different signature.  The proposed "auto
coding" shall inherit the signature in this case.

Thanks for any help.
Alain

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Mon, 04 Dec 2017 16:55:02 GMT) Full text and rfc822 format available.

Message #38 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Alain Schneble <a.s <at> realize.ch>
Cc: Simon Ledergerber <sledergerber <at> gmx.net>, Eli Zaretskii <eliz <at> gnu.org>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Mon, 04 Dec 2017 11:54:03 -0500

Now reported with "fix this or get removed from the distribution"
severity at <https://bugs.debian.org/883434>.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Mon, 04 Dec 2017 17:40:01 GMT) Full text and rfc822 format available.

Message #41 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Glenn Morris <rgm <at> gnu.org>
Cc: Simon Ledergerber <sledergerber <at> gmx.net>, Eli Zaretskii <eliz <at> gnu.org>,
 Alain Schneble <a.s <at> realize.ch>, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Mon, 04 Dec 2017 12:38:57 -0500

> Now reported with "fix this or get removed from the distribution"
> severity at <https://bugs.debian.org/883434>.

I'm curious to see if the OP's "grave" severity settings will stick.
"Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:

    makes the package in question unusable or mostly so, or causes data
    loss, or introduces a security hole allowing access to the accounts
    of users who use the package.

The only part that could arguably apply is "causes data loss", but even
that is stretching the meaning of those words, I think.

This said, we should indeed fix this bug.
Not sure how to Do It Right but least this specific problem should be
fixable with a patch along the lines of the one below (guaranteed 100%
untested).


        Stefan


diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 019e65b2c6..5c0675aa2f 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -1885,6 +1885,12 @@ auto-coding-alist-lookup
 	(setq alist (cdr alist))))
     coding-system))
 
+(defun mule--coding-system-compatible-p (cs new-cs)
+  "Return non-nil if CS is one of the coding-systems described by NEW-CS."
+  (let ((base (coding-system-base cs)))
+    (or (eq base new-cs)
+        (eq base (intern (concat new-cs "-with-signature"))))))
+
 (put 'enable-character-translation 'permanent-local t)
 (put 'enable-character-translation 'safe-local-variable	'booleanp)
 
@@ -2038,8 +2044,12 @@ find-auto-coding
 				(save-excursion
 				  (goto-char (point-min))
 				  (funcall (pop funcs) size)))))
-	(if coding-system
-	    (cons coding-system 'auto-coding-functions)))))
+	(and coding-system
+             ;; Don't override utf-8-with-signature with utf-8
+             ;; or latin-1-mac with latin-1 (bug#20623).
+             (not (mule--coding-system-compatible-p
+                   buffer-file-coding-system coding-system))
+	     (cons coding-system 'auto-coding-functions)))))
 
 (defun set-auto-coding (filename size)
   "Return coding system for a file FILENAME of which SIZE bytes follow point.

Changed bug title to 'XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save' from 'XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save' Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Mon, 04 Dec 2017 17:44:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Mon, 04 Dec 2017 20:29:01 GMT) Full text and rfc822 format available.

Message #46 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Mon, 04 Dec 2017 22:28:20 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: Alain Schneble <a.s <at> realize.ch>,  Simon Ledergerber <sledergerber <at> gmx.net>,  20623 <at> debbugs.gnu.org,  Eli Zaretskii <eliz <at> gnu.org>
> Date: Mon, 04 Dec 2017 12:38:57 -0500
> 
> This said, we should indeed fix this bug.

Agreed.

> Not sure how to Do It Right but least this specific problem should be
> fixable with a patch along the lines of the one below (guaranteed 100%
> untested).

Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
where the root cause is, AFAIU.

And I don't understand the comment about latin-1-mac: I don't think we
have such problems in Emacs.  The -with-signature variety is
different, because it is not about EOL format.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Mon, 04 Dec 2017 21:09:01 GMT) Full text and rfc822 format available.

Message #49 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Mon, 04 Dec 2017 16:08:14 -0500

> Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
> where the root cause is, AFAIU.

I'd expect the same problem would affect all other uses.

> And I don't understand the comment about latin-1-mac: I don't think we
> have such problems in Emacs.  The -with-signature variety is
> different, because it is not about EOL format.

You might be right, but I don't know where/how this is handled.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 10 Dec 2017 19:18:01 GMT) Full text and rfc822 format available.

Message #52 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sun, 10 Dec 2017 21:17:00 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: rgm <at> gnu.org,  a.s <at> realize.ch,  sledergerber <at> gmx.net,  20623 <at> debbugs.gnu.org
> Date: Mon, 04 Dec 2017 16:08:14 -0500
> 
> > Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
> > where the root cause is, AFAIU.
> 
> I'd expect the same problem would affect all other uses.

Not sure what you meant by "all other uses".  Could you please
elaborate?

> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs.  The -with-signature variety is
> > different, because it is not about EOL format.
> 
> You might be right, but I don't know where/how this is handled.

I would like to propose the following alternative patch, which accepts
utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
purposes of encoding of XML files.  Comments?  Do we want a similar
treatment for UTF-16?  (That doesn't seem to be required by the bug
report, and UTF-16 in XML files is non-standard anyway.  But what
about HTML?)

diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 857fa80..5ff1acf 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function
 	    (let* ((match (match-string 1))
 		   (sym (intern (downcase match))))
 	      (if (coding-system-p sym)
-		  sym
+                  ;; If the encoding tag is UTF-8 and the buffer's
+                  ;; encoding is one of the variants of UTF-8, use the
+                  ;; buffer's encoding.  This allows, e.g., saving an
+                  ;; XML file as UTF-8 with BOM when the tag says UTF-8.
+                  (if (and (coding-system-equal 'utf-8
+                                                (coding-system-type sym))
+                           (coding-system-equal sym
+                                                (coding-system-type
+                                                 buffer-file-coding-system)))
+                      buffer-file-coding-system
+		    sym)
 		(message "Warning: unknown coding system \"%s\"" match)
 		nil))
           ;; Files without an encoding tag should be UTF-8. But users
@@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function
                    (coding-system-base
                     (detect-coding-region (point-min) size t)))))
             ;; Pure ASCII always comes back as undecided.
-            (if (memq detected '(utf-8 undecided))
+            (if (memq detected
+                      '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided))
                 'utf-8
               (warn "File contents detected as %s.
   Consider adding an encoding attribute to the xml declaration,

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Fri, 15 Dec 2017 09:10:02 GMT) Full text and rfc822 format available.

Notification sent to Simon Ledergerber <sledergerber <at> gmx.net>:
bug acknowledged by developer. (Fri, 15 Dec 2017 09:10:03 GMT) Full text and rfc822 format available.

Message #57 received at 20623-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: monnier <at> iro.umontreal.ca
Cc: sledergerber <at> gmx.net, a.s <at> realize.ch, 20623-done <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Fri, 15 Dec 2017 11:08:50 +0200

> Date: Sun, 10 Dec 2017 21:17:00 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
> 
> I would like to propose the following alternative patch, which accepts
> utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
> purposes of encoding of XML files.  Comments?  Do we want a similar
> treatment for UTF-16?  (That doesn't seem to be required by the bug
> report, and UTF-16 in XML files is non-standard anyway.  But what
> about HTML?)

No further comments, so I've pushed the change and I'm marking this
bug done.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 12 Jan 2018 12:24:04 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 01 Aug 2018 17:49:01 GMT) Full text and rfc822 format available.

bug Marked as fixed in versions 26.1. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 01 Aug 2018 17:49:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Wed, 01 Aug 2018 18:08:02 GMT) Full text and rfc822 format available.

Message #66 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: sledergerber <at> gmx.net, a.s <at> realize.ch,
 Stefan Monnier <monnier <at> iro.umontreal.ca>, 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration lose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Wed, 01 Aug 2018 14:07:28 -0400

The HTML (not XML) case specified in the original report
("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
https://bugs.debian.org/883434 seems unfixed.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Wed, 01 Aug 2018 18:42:01 GMT) Full text and rfc822 format available.

Message #69 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Glenn Morris <rgm <at> gnu.org>
Cc: sledergerber <at> gmx.net, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration lose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Wed, 01 Aug 2018 21:41:15 +0300

> From: Glenn Morris <rgm <at> gnu.org>
> Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>,  20623 <at> debbugs.gnu.org,  a.s <at> realize.ch,  sledergerber <at> gmx.net
> Date: Wed, 01 Aug 2018 14:07:28 -0400
> 
> The HTML (not XML) case specified in the original report
> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
> https://bugs.debian.org/883434 seems unfixed.

Should it be?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Tue, 07 Aug 2018 19:16:02 GMT) Full text and rfc822 format available.

Message #72 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: sledergerber <at> gmx.net, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration lose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Tue, 07 Aug 2018 15:14:58 -0400

Eli Zaretskii wrote:

>> The HTML (not XML) case specified in the original report
>> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
>> https://bugs.debian.org/883434 seems unfixed.
>
> Should it be?

I think this a bug that should be fixed, yes (if that is the question).

bug Marked as found in versions 26.1; no longer marked as fixed in versions 26.1 and reopened. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 07 Aug 2018 19:16:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Wed, 08 Aug 2018 09:48:01 GMT) Full text and rfc822 format available.

Message #77 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Glenn Morris <rgm <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org>,
 Alain Schneble <a.s <at> realize.ch>, 20623 <at> debbugs.gnu.org,
 Simon Ledergerber <sledergerber <at> gmx.net>
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Wed, 8 Aug 2018 11:47:48 +0200

On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > Now reported with "fix this or get removed from the distribution"
> > severity at <https://bugs.debian.org/883434>.
> 
> I'm curious to see if the OP's "grave" severity settings will stick.
> "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
> 
>     makes the package in question unusable or mostly so, or causes data
>     loss, or introduces a security hole allowing access to the accounts
>     of users who use the package.
> 
> The only part that could arguably apply is "causes data loss", but even
> that is stretching the meaning of those words, I think.

Actually there's the issue that the coding system (in Emacs sense)
is changed, but also the fact that this change is invisible to the
user (mainly because the BOM is usually not visible), which makes
the issue even worse. Basically, this is invisible data corruption.
Even though only two bytes are removed, this introduces breakage in
other applications, and it can take much time to the user to find
the cause.

Emacs should not change the coding system when not needed, and when
it needs to, it must make sure to have a confirmation from the user.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Wed, 08 Aug 2018 14:46:02 GMT) Full text and rfc822 format available.

Message #80 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: Glenn Morris <rgm <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org>,
 Alain Schneble <a.s <at> realize.ch>, 20623 <at> debbugs.gnu.org,
 Simon Ledergerber <sledergerber <at> gmx.net>
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Wed, 08 Aug 2018 10:45:24 -0400

> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
>
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.

FWIW, I agree:  I don't think it qualifies as Debian's definition of
"grave", but there is no doubt that it's a bug and that we should
fix it.


        Stefan

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 11 Aug 2018 09:16:01 GMT) Full text and rfc822 format available.

Notification sent to Simon Ledergerber <sledergerber <at> gmx.net>:
bug acknowledged by developer. (Sat, 11 Aug 2018 09:16:02 GMT) Full text and rfc822 format available.

Message #85 received at 20623-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623-done <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 11 Aug 2018 12:15:31 +0300

> Date: Wed, 8 Aug 2018 11:47:48 +0200
> From: Vincent Lefevre <vincent <at> vinc17.net>
> Cc: Glenn Morris <rgm <at> gnu.org>, Simon Ledergerber <sledergerber <at> gmx.net>,
> 	Eli Zaretskii <eliz <at> gnu.org>, Alain Schneble <a.s <at> realize.ch>,
> 	20623 <at> debbugs.gnu.org
> 
> On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > > Now reported with "fix this or get removed from the distribution"
> > > severity at <https://bugs.debian.org/883434>.
> > 
> > I'm curious to see if the OP's "grave" severity settings will stick.
> > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
> > 
> >     makes the package in question unusable or mostly so, or causes data
> >     loss, or introduces a security hole allowing access to the accounts
> >     of users who use the package.
> > 
> > The only part that could arguably apply is "causes data loss", but even
> > that is stretching the meaning of those words, I think.
> 
> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
> 
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.

I agree with the last paragraph, so I've now fixed the remaining issue
of this bug (with HTML files) on the emacs-26 branch.

However, I would respectfully request that in the future bug reports
be accurate and fair in the assigned severity, and in particular make
sure that the severity matches the actual behavior as judged
objectively.

In this case, I cannot but express my extreme surprise to see such a
minor issue described as "grave".  The alleged data loss is minor, if
it exists at all (the BOM is not data important for the user, nor data
whose loss cannot be easily repaired).  The unspecified "breakage in
other applications" cannot be considered without the missing details,
but in general I'd be surprised to hear about modern applications
(browsers?) that really need a BOM in UTF-8 encoded HTML files to the
degree that the lack of BOM causes them to "break" in some way; if
they do, it could arguably be a bug in those applications.

Bottom line: artificially and unreasonably increasing the severity
level doesn't help the motivation to fix the bug, and if anything, has
the opposite effect of ignoring the source of the bug report as not
serious.  I'm sure we don't want that, certainly not for bugs reported
by Debian.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 10:14:02 GMT) Full text and rfc822 format available.

Message #88 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 11 Aug 2018 12:13:41 +0200

On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> In this case, I cannot but express my extreme surprise to see such a
> minor issue described as "grave".  The alleged data loss is minor, if
> it exists at all (the BOM is not data important for the user,

You're completely wrong. The presence of BOM or not is very important
for some applications, such as Firefox (not to determine the charset,
but the MIME type of local files).

> nor data whose loss cannot be easily repaired).

It can be repaired, but the problems are the user doesn't know
what's going on and this breaks things. If some package removed
the execute permission of some utility in /bin, this would also
be a grave bug, though it can easily been repaired.

> The unspecified "breakage in
> other applications" cannot be considered without the missing details,
> but in general I'd be surprised to hear about modern applications
> (browsers?) that really need a BOM in UTF-8 encoded HTML files to the
> degree that the lack of BOM causes them to "break" in some way; if
> they do, it could arguably be a bug in those applications.

Firefox. And that's actually the way I detected the bug, after
hours of trying to find why it was behaving in an inconsistent way.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 10:46:02 GMT) Full text and rfc822 format available.

Message #91 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 11 Aug 2018 13:45:17 +0300

> Date: Sat, 11 Aug 2018 12:13:41 +0200
> From: Vincent Lefevre <vincent <at> vinc17.net>
> Cc: monnier <at> iro.umontreal.ca, rgm <at> gnu.org, sledergerber <at> gmx.net,
> 	a.s <at> realize.ch, 20623 <at> debbugs.gnu.org
> 
> On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> > In this case, I cannot but express my extreme surprise to see such a
> > minor issue described as "grave".  The alleged data loss is minor, if
> > it exists at all (the BOM is not data important for the user,
> 
> You're completely wrong. The presence of BOM or not is very important
> for some applications, such as Firefox (not to determine the charset,
> but the MIME type of local files).

Please provide the details, including the use case, if possible.  I'm
still in the dark regarding the importance of the BOM in UTF-8 encoded
HTML stuff.

> It can be repaired, but the problems are the user doesn't know
> what's going on and this breaks things.

I agree about the user not knowing, but that doesn't yet qualify as
"data loss", which has an widely accepted meaning.

> If some package removed the execute permission of some utility in
> /bin, this would also be a grave bug, though it can easily been
> repaired.

Well, I disagree about the "grave" part, because that means the
package is unusable, causes data loss, or introduces a security hole
allowing access to the user account.  None of that is true in the case
in point.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 12:46:01 GMT) Full text and rfc822 format available.

Message #94 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 11 Aug 2018 08:45:15 -0400

>> > Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
>> > where the root cause is, AFAIU.
>> I'd expect the same problem would affect all other uses.
> Not sure what you meant by "all other uses".  Could you please
> elaborate?

Your commit ec6f588940e51013435408a456c10d33ddf98fb2 answers that
question: at least sgml-html-meta-auto-coding-function is one of those
"other uses".

> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs.  The -with-signature variety is
> > different, because it is not about EOL format.
> You might be right, but I don't know where/how this is handled.

I still don't know where the EOL part is handled.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 13:55:02 GMT) Full text and rfc822 format available.

Message #97 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 11 Aug 2018 16:54:04 +0300

> From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
> Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
> Date: Sat, 11 Aug 2018 08:45:15 -0400
> 
> > > And I don't understand the comment about latin-1-mac: I don't think we
> > > have such problems in Emacs.  The -with-signature variety is
> > > different, because it is not about EOL format.
> > You might be right, but I don't know where/how this is handled.
> 
> I still don't know where the EOL part is handled.

If you tell me what do you mean by "handled" in this context, I might
be able to help you understand where that happens.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 15:42:01 GMT) Full text and rfc822 format available.

Message #100 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 11 Aug 2018 17:41:01 +0200

On 2018-08-11 13:45:17 +0300, Eli Zaretskii wrote:
> > Date: Sat, 11 Aug 2018 12:13:41 +0200
> > From: Vincent Lefevre <vincent <at> vinc17.net>
> > Cc: monnier <at> iro.umontreal.ca, rgm <at> gnu.org, sledergerber <at> gmx.net,
> > 	a.s <at> realize.ch, 20623 <at> debbugs.gnu.org
> > 
> > On 2018-08-11 12:15:31 +0300, Eli Zaretskii wrote:
> > > In this case, I cannot but express my extreme surprise to see such a
> > > minor issue described as "grave".  The alleged data loss is minor, if
> > > it exists at all (the BOM is not data important for the user,
> > 
> > You're completely wrong. The presence of BOM or not is very important
> > for some applications, such as Firefox (not to determine the charset,
> > but the MIME type of local files).
> 
> Please provide the details, including the use case, if possible.  I'm
> still in the dark regarding the importance of the BOM in UTF-8 encoded
> HTML stuff.

  https://bugzilla.mozilla.org/show_bug.cgi?id=1422889

for HTML. Wontfix because of:

  https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm

For text/plain only (but this is another example that BOM can matter
in practice), there's

  https://bugzilla.mozilla.org/show_bug.cgi?id=1071816

(which is a bug that should be fixed).

> > It can be repaired, but the problems are the user doesn't know
> > what's going on and this breaks things.
> 
> I agree about the user not knowing, but that doesn't yet qualify as
> "data loss", which has an widely accepted meaning.

This is data corruption, which is a form of data loss, because some
information is lost in the process (I recall that Emacs does not
provide any information to the user about this transformation).

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sat, 11 Aug 2018 16:28:02 GMT) Full text and rfc822 format available.

Message #103 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sat, 11 Aug 2018 19:27:33 +0300

> Date: Sat, 11 Aug 2018 17:41:01 +0200
> From: Vincent Lefevre <vincent <at> vinc17.net>
> Cc: monnier <at> iro.umontreal.ca, rgm <at> gnu.org, sledergerber <at> gmx.net,
> 	a.s <at> realize.ch, 20623 <at> debbugs.gnu.org
> 
> > > You're completely wrong. The presence of BOM or not is very important
> > > for some applications, such as Firefox (not to determine the charset,
> > > but the MIME type of local files).
> > 
> > Please provide the details, including the use case, if possible.  I'm
> > still in the dark regarding the importance of the BOM in UTF-8 encoded
> > HTML stuff.
> 
>   https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
> 
> for HTML. Wontfix because of:
> 
>   https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm
> 
> For text/plain only (but this is another example that BOM can matter
> in practice), there's
> 
>   https://bugzilla.mozilla.org/show_bug.cgi?id=1071816
> 
> (which is a bug that should be fixed).

Maybe I'm missing something, but none of these issues describes the
situation in this bug report, namely: an HTML file with an explicit
charset= tag, with or without a BOM.  In fact, the first of these
issues happens only in files that _do_ have a BOM, so you could say
that Emacs did you a favor by removing it ;-)

> > I agree about the user not knowing, but that doesn't yet qualify as
> > "data loss", which has an widely accepted meaning.
> 
> This is data corruption, which is a form of data loss, because some
> information is lost in the process (I recall that Emacs does not
> provide any information to the user about this transformation).

That is the most inclusive interpretation of "data loss" I've ever
seen.  "Some information is lost" is nowhere near what "grave bug"
means by "data loss", so I don't think "grave" applies here.

Anyway, the Emacs issue is now fixed.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 12 Aug 2018 00:05:02 GMT) Full text and rfc822 format available.

Message #106 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 11 Aug 2018 20:04:05 -0400

>> > > And I don't understand the comment about latin-1-mac: I don't think we
>> > > have such problems in Emacs.  The -with-signature variety is
>> > > different, because it is not about EOL format.
>> > You might be right, but I don't know where/how this is handled.
>> I still don't know where the EOL part is handled.
> If you tell me what do you mean by "handled" in this context, I might
> be able to help you understand where that happens.

You say that the code I wrote is not needed to make sure an existing
latin-1-mac setting isn't overwritten by a latin-1 guess.  I expect this
is indeed true (otherwise I think we'd have had bug-reports about it),
but I don't know where that is handled.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 12 Aug 2018 00:12:02 GMT) Full text and rfc822 format available.

Message #109 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: rgm <at> gnu.org, Eli Zaretskii <eliz <at> gnu.org>, a.s <at> realize.ch,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sat, 11 Aug 2018 20:11:49 -0400

>> > > In this case, I cannot but express my extreme surprise to see such a
>> > > minor issue described as "grave".  The alleged data loss is minor, if
>> > > it exists at all (the BOM is not data important for the user,
>> > You're completely wrong. The presence of BOM or not is very important
>> > for some applications, such as Firefox (not to determine the charset,
>> > but the MIME type of local files).
>> Please provide the details, including the use case, if possible.  I'm
>> still in the dark regarding the importance of the BOM in UTF-8 encoded
>> HTML stuff.
>   https://bugzilla.mozilla.org/show_bug.cgi?id=1422889

I don't see any data loss there.


        Stefan


PS: We can all cook up contrived scenarios where this bug leads to a serious
loss of data.  But in that case a problem in C-n which makes it move to
the wrong column would also qualify as "grave" because I can just as
well construct a contrived scenarios where such a bug leads to a serious
loss of data.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 12 Aug 2018 00:59:01 GMT) Full text and rfc822 format available.

Message #112 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: rgm <at> gnu.org, Eli Zaretskii <eliz <at> gnu.org>, a.s <at> realize.ch,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sun, 12 Aug 2018 02:58:53 +0200

On 2018-08-11 20:11:49 -0400, Stefan Monnier wrote:
> >> Please provide the details, including the use case, if possible.  I'm
> >> still in the dark regarding the importance of the BOM in UTF-8 encoded
> >> HTML stuff.
> >   https://bugzilla.mozilla.org/show_bug.cgi?id=1422889
> 
> I don't see any data loss there.

Because it is not there, it is in Emacs. What the Mozilla bug shows
is that the presence of BOM or not is important and yields very
different behavior.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 12 Aug 2018 01:35:02 GMT) Full text and rfc822 format available.

Message #115 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, monnier <at> iro.umontreal.ca,
 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8"
 declaration loose BOM; Coding system is reset from utf-8-with-signature to
 utf-8 on save
Date: Sun, 12 Aug 2018 03:34:25 +0200

On 2018-08-11 19:27:33 +0300, Eli Zaretskii wrote:
> Maybe I'm missing something, but none of these issues describes the
> situation in this bug report, namely: an HTML file with an explicit
> charset= tag, with or without a BOM.  In fact, the first of these
> issues happens only in files that _do_ have a BOM, so you could say
> that Emacs did you a favor by removing it ;-)

In theory yes, but in practice, one does not want that when doing
file-loading tests. Otherwise the tests become meaningless. This
is just list a spellchecker that automatically corrects spelling
mistakes without the user knowledge (even when it is right), as
if the goal is to write something about a spelling mistake, the
text becomes meaningless. Or when some characters are changed
automatically to improve typography (as this can be seen by some
blog software when posting, with no previewing), as this can make
the text meaningless, e.g. when it is code.

> Anyway, the Emacs issue is now fixed.

OK, thanks.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20623; Package emacs. (Sun, 12 Aug 2018 19:09:02 GMT) Full text and rfc822 format available.

Message #118 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sun, 12 Aug 2018 22:07:57 +0300

> From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
> Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
> Date: Sat, 11 Aug 2018 20:04:05 -0400
> 
> You say that the code I wrote is not needed to make sure an existing
> latin-1-mac setting isn't overwritten by a latin-1 guess.  I expect this
> is indeed true (otherwise I think we'd have had bug-reports about it),
> but I don't know where that is handled.

It is handled inside select-safe-coding-system, which first invokes
find-auto-coding to decide which encoding is appropriate (and as part
of that, looks at XML or HTML charset information declared by the
text), and then, if the encoding it got doesn't specify the EOL
conversion, it uses the EOL conversion from the buffer's encoding or
from the appropriate defaults.

Since XML/HTML charset tags never specify the EOL conversion, it
follows that Emacs will never override the EOL conversion of the
buffer, it will only use the charset for "text conversion".

I hope this answers your question.

bug Marked as fixed in versions 26.2. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 21 Aug 2018 16:56:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 19 Sep 2018 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 279 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20623 XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

GNU bug report logs - #20623
XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save