GNU bug report logs - #15803
default-file-name-coding-system: utf-8 better than latin-1 these days?

Previous Next

Package: emacs;

Reported by: Glenn Morris <rgm <at> gnu.org>

Date: Mon, 4 Nov 2013 18:46:01 UTC

Severity: normal

Tags: fixed

Found in version 24.3

Fixed in version 28.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15803 in the body.
You can then email your comments to 15803 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to handa <at> gnu.org, bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Mon, 04 Nov 2013 18:46:01 GMT) Full text and rfc822 format available.

Message #3 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: submit <at> debbugs.gnu.org
Subject: default-file-name-coding-system: utf-8 better than latin-1 these days?
Date: Mon, 04 Nov 2013 13:45:32 -0500

Package: emacs
Version: 24.3

Split from http://debbugs.gnu.org/15260

Eli Zaretskii wrote:

> mule-cmds.el calls reset-language-environment, and language/english.el
> calls set-language-info-alist; both have the effect of resetting
> default-file-name-coding-system to latin-1 (!? an interesting
> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
> we still do that).

I know nothing about this, but eg glib defaults to utf-8, which seems
like a better default to me these days:

https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 01 Dec 2017 01:53:02 GMT) Full text and rfc822 format available.

Message #6 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Thu, 30 Nov 2017 20:52:17 -0500

Glenn Morris wrote:

>> mule-cmds.el calls reset-language-environment, and language/english.el
>> calls set-language-info-alist; both have the effect of resetting
>> default-file-name-coding-system to latin-1 (!? an interesting
>> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
>> we still do that).
>
> I know nothing about this, but eg glib defaults to utf-8, which seems
> like a better default to me these days:
>
> https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings

... 4 years pass and latin-1 fails to make a comeback.

For some reason, I thought it was difficult to change the default to
utf-8 due to bootstrap ordering issues. This was probably prompted by
this comment in reset-language-environment:

  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
  ;; that is not yet defined, so we set it in set-locale-environment instead.
  (setq default-file-name-coding-system 'iso-latin-1-unix)

But looking at it now, I cannot see what this comment is referring to.

If I change reset-language-environment so that it sets
default-file-name-coding-system (and default-sendmail-coding-system)
to 'utf-8, then a bootstrap works fine.

It looks like this stuff was all rewritten in Emacs 23.
Before that, there used to be international/utf-8.el,
which was indeed loaded after mule-cmds.
But since Emacs 23, mule-conf seems to define everything.
(But that rewrite seems to predate the above comment about Darwin...?)

So should the default finally be changed to utf-8?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 01 Dec 2017 07:56:02 GMT) Full text and rfc822 format available.

Message #9 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Glenn Morris <rgm <at> gnu.org>
Cc: 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 01 Dec 2017 09:54:36 +0200

> From: Glenn Morris <rgm <at> gnu.org>
> Date: Thu, 30 Nov 2017 20:52:17 -0500
> 
> So should the default finally be changed to utf-8?

Perhaps on Posix systems, but not elsewhere.  And if we make the
change, we should make sure building Emacs in a non-ASCII directory
still works.

Btw, why does the default matter so much?  Once Emacs starts up
default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
locale says so.  Is this just an aesthetic issue?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Tue, 05 Dec 2017 00:36:01 GMT) Full text and rfc822 format available.

Message #12 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Mon, 04 Dec 2017 19:35:05 -0500

Eli Zaretskii wrote:

> Perhaps on Posix systems, but not elsewhere. 

I assume non-POSIX is newspeak for MS-Windows (native and DOS).

> And if we make the change, we should make sure building Emacs in a
> non-ASCII directory still works.

It works fine for me on G/L to have source, build, and install
directories be distinct non-ASCII directories. (Emacs works, that is,
but makeinfo 5.1 fails to find @include files in non-ASCII directories,
so I wonder how common such setups are.)


BTW, it feels very dated to me to have discussion of Windows 9X in the
Emacs manual section on file-name-coding.


diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
index 78f77cb..5fc44a6 100644
--- i/doc/emacs/mule.texi
+++ w/doc/emacs/mule.texi
@@ -1214,11 +1214,8 @@ system can encode.
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
index 9d22d6e..192f0e9 100644
--- i/lisp/international/mule-cmds.el
+++ w/lisp/international/mule-cmds.el
@@ -1797,10 +1797,11 @@ The default status is as follows:
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1816,8 +1817,10 @@ The default status is as follows:
 	(input-coding
 	 (condition-case nil
 	     (coding-system-change-text-conversion
-	      (cdr default-process-coding-system) 'iso-latin-1)
-	   (coding-system-error 'iso-latin-1))))
+	      (cdr default-process-coding-system)
+	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+	   (coding-system-error
+	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
 	  (cons output-coding input-coding)))
 
diff --git i/lisp/mail/sendmail.el w/lisp/mail/sendmail.el
index cd80211..36fbb7d 100644
--- i/lisp/mail/sendmail.el
+++ w/lisp/mail/sendmail.el
@@ -993,7 +993,7 @@ but lower priority than the local value of `buffer-file-coding-system'.
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
index 98067ce..25118cd 100644
--- i/lisp/mh-e/mh-comp.el
+++ w/lisp/mh-e/mh-comp.el
@@ -304,6 +304,7 @@ message and scan line."
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (and (local-variable-p 'buffer-file-coding-system
                                     (current-buffer)) ;XEmacs needs two args
@@ -315,7 +316,7 @@ message and scan line."
            (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                (and (default-boundp 'buffer-file-coding-system)
                     (default-value 'buffer-file-coding-system))
-               'iso-latin-1))))
+               'utf-8))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 08 Dec 2017 09:48:01 GMT) Full text and rfc822 format available.

Message #15 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Glenn Morris <rgm <at> gnu.org>
Cc: 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 08 Dec 2017 11:46:29 +0200

> From: Glenn Morris <rgm <at> gnu.org>
> Cc: 15803 <at> debbugs.gnu.org
> Date: Mon, 04 Dec 2017 19:35:05 -0500
> 
> Eli Zaretskii wrote:
> 
> > Perhaps on Posix systems, but not elsewhere. 
> 
> I assume non-POSIX is newspeak for MS-Windows (native and DOS).

I didn't say "non-Posix"; you did.

MS-Windows is definitely not a Posix system, but whether it is the
only one, I don't know.  Are we sure all macOS/Darwin systems are
sufficiently Posix in this aspect?  AFAIR they use quite different
encoding methods for file names (canonical normalization etc.).

> > And if we make the change, we should make sure building Emacs in a
> > non-ASCII directory still works.
> 
> It works fine for me on G/L to have source, build, and install
> directories be distinct non-ASCII directories.

Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
potentially problematic case, AFAIR.

> (Emacs works, that is,
> but makeinfo 5.1 fails to find @include files in non-ASCII directories,
> so I wonder how common such setups are.)

Building a release tarball doesn't require makeinfo.

> BTW, it feels very dated to me to have discussion of Windows 9X in the
> Emacs manual section on file-name-coding.

We still try to support it, and the aspects of file-name encoding
related to it are definitely non-trivial.  Everything described there
is in the code.

> diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
> index 78f77cb..5fc44a6 100644
> --- i/doc/emacs/mule.texi
> +++ w/doc/emacs/mule.texi
> @@ -1214,11 +1214,8 @@ system can encode.
>  
>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>  default coding system determined by the selected language environment,
> -and stored in the @code{default-file-name-coding-system} variable.
> -@c FIXME?  Is this correct?  What is the "default language environment"?
> -In the default language environment, non-@acronym{ASCII} characters in
> -file names are not encoded specially; they appear in the file system
> -using the internal Emacs representation.
> +and stored in the @code{default-file-name-coding-system} variable
> +(normally UTF-8).

Not sure why you removed the sentence which had the FIXME comment.  Is
it in any way related to the issue at hand?

>  @cindex file-name encoding, MS-Windows
>  @vindex w32-unicode-filenames
> diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
> index 9d22d6e..192f0e9 100644
> --- i/lisp/international/mule-cmds.el
> +++ w/lisp/international/mule-cmds.el
> @@ -1797,10 +1797,11 @@ The default status is as follows:
>     'raw-text)
>  
>    (set-default-coding-systems nil)
> -  (setq default-sendmail-coding-system 'iso-latin-1)
> -  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
> -  ;; that is not yet defined, so we set it in set-locale-environment instead.
> -  (setq default-file-name-coding-system 'iso-latin-1-unix)
> +  (setq default-sendmail-coding-system 'utf-8)
> +  (setq default-file-name-coding-system (if (memq system-type
> +                                                  '(window-nt ms-dos))
> +                                            'iso-latin-1-unix
> +                                          'utf-8-unix))

Why are we changing sendmail-coding-system?  It has nothing to do with
file names, AFAIK.

>    ;; Preserve eol-type from existing default-process-coding-systems.
>    ;; On non-unix-like systems in particular, these may have been set
>    ;; carefully by the user, or by the startup code, to deal with the
> @@ -1816,8 +1817,10 @@ The default status is as follows:
>  	(input-coding
>  	 (condition-case nil
>  	     (coding-system-change-text-conversion
> -	      (cdr default-process-coding-system) 'iso-latin-1)
> -	   (coding-system-error 'iso-latin-1))))
> +	      (cdr default-process-coding-system)
> +	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
> +	   (coding-system-error
> +	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
>      (setq default-process-coding-system
>  	  (cons output-coding input-coding)))

And this changes the default encoding used to communicate with
sub-processes.  Why?  We never talked about a wholesale change of all
the defaults to UTF-8, that is a much more broad issue than just
encoding of file names.

> diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
> index 98067ce..25118cd 100644
> --- i/lisp/mh-e/mh-comp.el
> +++ w/lisp/mh-e/mh-comp.el
> @@ -304,6 +304,7 @@ message and scan line."
>    (let ((draft-buffer (current-buffer))
>          (file-name buffer-file-name)
>          (config mh-previous-window-config)
> +        ;; FIXME this is subtly different to select-message-coding-system.
>          (coding-system-for-write
>           (if (and (local-variable-p 'buffer-file-coding-system
>                                      (current-buffer)) ;XEmacs needs two args
> @@ -315,7 +316,7 @@ message and scan line."
>             (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
>                 (and (default-boundp 'buffer-file-coding-system)
>                      (default-value 'buffer-file-coding-system))
> -               'iso-latin-1))))
> +               'utf-8))))

Changes like that in MH-E should be communicated to the MH-E
developer; I 'm not sure he is reading this list.

And you never answered my question about the rationale:

> Btw, why does the default matter so much?  Once Emacs starts up
> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
> locale says so.  Is this just an aesthetic issue?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Tue, 12 Dec 2017 01:39:02 GMT) Full text and rfc822 format available.

Message #18 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Mon, 11 Dec 2017 20:38:15 -0500

Eli Zaretskii wrote:

> Are we sure all macOS/Darwin systems are sufficiently Posix in this
> aspect?

Emacs on Darwin has been unconditionally using utf-8 for over a decade.
It's special-cased in mule-cmds, as visible in the diff I sent.

>> It works fine for me on G/L to have source, build, and install
>> directories be distinct non-ASCII directories.
>
> Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
> potentially problematic case, AFAIR.

I had LANG=en_US.UTF-8. I've repeated with LANG=en_US. Still works.

>>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>>  default coding system determined by the selected language environment,
>> -and stored in the @code{default-file-name-coding-system} variable.
>> -@c FIXME?  Is this correct?  What is the "default language environment"?
>> -In the default language environment, non-@acronym{ASCII} characters in
>> -file names are not encoded specially; they appear in the file system
>> -using the internal Emacs representation.
>> +and stored in the @code{default-file-name-coding-system} variable
>> +(normally UTF-8).
>
> Not sure why you removed the sentence which had the FIXME comment.  Is
> it in any way related to the issue at hand?

I wrote the FIXME comment. In 5 years, no-one has addressed it.
Defaulting to UTF-8 makes it no longer relevant, so it seems better to
remove it.

> Why are we changing sendmail-coding-system?  It has nothing to do with
> file names, AFAIK.

I'm changing all (3) things that currently default to latin-1 to default to
utf-8.

>> Btw, why does the default matter so much?  Once Emacs starts up
>> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
>> locale says so.  Is this just an aesthetic issue?

utf-8 is the sensible, "modern" (ie, non-ancient) default.
If there is no reason to use latin-1, Emacs should use utf-8.
I'm not claiming it's critical.

Take it or leave it, as you wish.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Wed, 09 Sep 2020 13:16:01 GMT) Full text and rfc822 format available.

Message #21 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Glenn Morris <rgm <at> gnu.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Wed, 09 Sep 2020 15:15:09 +0200

Glenn Morris <rgm <at> gnu.org> writes:

> utf-8 is the sensible, "modern" (ie, non-ancient) default.
> If there is no reason to use latin-1, Emacs should use utf-8.
> I'm not claiming it's critical.
>
> Take it or leave it, as you wish.

That was the final message in the thread.  Glenn's patch from six years
ago no longer applied, so I've respun it for Emacs 28 now (included
below).

Glenn's arguments make sense to me, but I'm not a domain expert here.
Does anybody object to applying this patch to Emacs 28?

diff --git a/doc/emacs/mule.texi b/doc/emacs/mule.texi
index 6eff0ca0d2..b78019020a 100644
--- a/doc/emacs/mule.texi
+++ b/doc/emacs/mule.texi
@@ -1215,11 +1215,8 @@ File Name Coding
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index ccc8ac9f9e..e3155dfc52 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -1799,13 +1799,11 @@ reset-language-environment
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  ;; [Actually, it seems to work fine to use utf-8-unix here, and not just
-  ;; on Darwin.  The previous comment seems to be outdated?
-  ;; See patch at https://debbugs.gnu.org/15803 ]
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1821,8 +1819,10 @@ reset-language-environment
 	(input-coding
 	 (condition-case nil
 	     (coding-system-change-text-conversion
-	      (cdr default-process-coding-system) 'iso-latin-1)
-	   (coding-system-error 'iso-latin-1))))
+	      (cdr default-process-coding-system)
+	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+	   (coding-system-error
+	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
 	  (cons output-coding input-coding)))
 
diff --git a/lisp/mail/sendmail.el b/lisp/mail/sendmail.el
index dd6eecbfd0..7610939e57 100644
--- a/lisp/mail/sendmail.el
+++ b/lisp/mail/sendmail.el
@@ -975,7 +975,7 @@ sendmail-coding-system
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git a/lisp/mh-e/mh-comp.el b/lisp/mh-e/mh-comp.el
index f7e30bfbb3..8a69adbb75 100644
--- a/lisp/mh-e/mh-comp.el
+++ b/lisp/mh-e/mh-comp.el
@@ -305,6 +305,7 @@ mh-send-letter
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (fboundp 'select-message-coding-system)
              (select-message-coding-system) ; Emacs has this since at least 21.1
@@ -318,7 +319,7 @@ mh-send-letter
              (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                  (and (default-boundp 'buffer-file-coding-system)
                       (default-value 'buffer-file-coding-system))
-                 'iso-latin-1)))))
+                 'utf-8)))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Wed, 09 Sep 2020 13:34:02 GMT) Full text and rfc822 format available.

Message #24 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Glenn Morris <rgm <at> gnu.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Wed, 9 Sep 2020 06:33:11 -0700

Glenn Morris <rgm <at> gnu.org> writes:

> BTW, it feels very dated to me to have discussion of Windows 9X in the
> Emacs manual section on file-name-coding.

Agreed.  Could we move this discussion to the MS Windows FAQ instead?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Wed, 09 Sep 2020 15:01:01 GMT) Full text and rfc822 format available.

Message #27 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Wed, 09 Sep 2020 18:00:28 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  15803 <at> debbugs.gnu.org
> Date: Wed, 09 Sep 2020 15:15:09 +0200
> 
> Glenn's arguments make sense to me, but I'm not a domain expert here.
> Does anybody object to applying this patch to Emacs 28?

Please try building Emacs from a pristine tarball or a clean
repository in a directory with non-ASCII characters, under a
non-UTF-8, non-C locale.  If that works, I think this is good to go.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Wed, 09 Sep 2020 15:10:02 GMT) Full text and rfc822 format available.

Message #30 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefan <at> marxist.se>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Wed, 09 Sep 2020 18:09:03 +0300

> From: Stefan Kangas <stefan <at> marxist.se>
> Date: Wed, 9 Sep 2020 06:33:11 -0700
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 15803 <at> debbugs.gnu.org
> 
> Glenn Morris <rgm <at> gnu.org> writes:
> 
> > BTW, it feels very dated to me to have discussion of Windows 9X in the
> > Emacs manual section on file-name-coding.
> 
> Agreed.  Could we move this discussion to the MS Windows FAQ instead?

I don't think the FAQ is the right place for this information.  So no,
please don't move it to the FAQ.

But we could move this to the MS-Windows appendix, leaving a
cross-reference where the text is now.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Thu, 10 Sep 2020 13:08:01 GMT) Full text and rfc822 format available.

Message #33 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Thu, 10 Sep 2020 15:07:12 +0200

[Message part 1 (text/plain, inline)]

Eli Zaretskii <eliz <at> gnu.org> writes:

> Please try building Emacs from a pristine tarball or a clean
> repository in a directory with non-ASCII characters, under a
> non-UTF-8, non-C locale.  If that works, I think this is good to go.

All the tools under Linux are so utf-8-focused these days...  let's
see...  I first, under a utf-8 locale created the directory "émacs",
then converted it to 8859-1:

[larsi <at> stories ~/src/emacs]$ convmv --notest -f UTF-8 -t ISO-8859-1 émacs 
mv "./émacs"	"./�macs"

Which ls displays, funnily enough, as:

-rw-r--r--  1 larsi larsi    0 Sep 10 14:50 ''$'\351''macs'

Then I did

export LANG=sv_SE.ISO-8859-1
export LANG=sv_SE.ISO-8859-1

and now the ls says the file is:

[Message part 2 (image/png, inline)]

[Message part 3 (text/plain, inline)]

And then I build Emacs there, and it seems to work fine.  Then I apply
the patch and say "make:

Loading /home/larsi/src/emacs/�*macs/lisp/subdirs.el (source)...
>>Error occurred processing ../lisp/international/mule-cmds.el: File is missing (("Opening input file" "No such file or directory" "/home/larsi/src/emacs/�*macs/lisp/international/mule-cmds.el"))
make[2]: *** [Makefile:279: ../lisp/international/mule-cmds.elc] Error 1
make[1]: *** [Makefile:784: ../lisp/international/mule-cmds.elc] Error 2
make[1]: Leaving directory '/home/larsi/src/emacs/�*macs/src'

So that fails pretty much immediately...

OK, let's try a make bootstrap...

And now building Emacs works fine.  So it seems like a make bootstrap is
necessary after applying the patch.

And starting Emacs works fine.

But "make check" fails miserably:

make[3]: *** [Makefile:165: src/eval-tests.elc] Error 1
  ELC      src/font-tests.elc
>>Error occurred processing src/fileio-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/\301\203*macs/test/src/fileio-tests.elc7HRcu0"))

So...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Thu, 10 Sep 2020 14:40:01 GMT) Full text and rfc822 format available.

Message #36 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Thu, 10 Sep 2020 17:39:37 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Thu, 10 Sep 2020 15:07:12 +0200
> 
> > Please try building Emacs from a pristine tarball or a clean
> > repository in a directory with non-ASCII characters, under a
> > non-UTF-8, non-C locale.  If that works, I think this is good to go.
> 
> All the tools under Linux are so utf-8-focused these days...  let's
> see...  I first, under a utf-8 locale created the directory "émacs",
> then converted it to 8859-1:

No, please create the directory with non-ASCII name _after_ switching
the locale to Latin-1.

> And then I build Emacs there, and it seems to work fine.  Then I apply
> the patch and say "make:
> 
> Loading /home/larsi/src/emacs/�*macs/lisp/subdirs.el (source)...
> >>Error occurred processing ../lisp/international/mule-cmds.el: File is missing (("Opening input file" "No such file or directory" "/home/larsi/src/emacs/�*macs/lisp/international/mule-cmds.el"))
> make[2]: *** [Makefile:279: ../lisp/international/mule-cmds.elc] Error 1
> make[1]: *** [Makefile:784: ../lisp/international/mule-cmds.elc] Error 2
> make[1]: Leaving directory '/home/larsi/src/emacs/�*macs/src'
> 
> So that fails pretty much immediately...
> 
> OK, let's try a make bootstrap...
> 
> And now building Emacs works fine.  So it seems like a make bootstrap is
> necessary after applying the patch.
> 
> And starting Emacs works fine.
> 
> But "make check" fails miserably:
> 
> make[3]: *** [Makefile:165: src/eval-tests.elc] Error 1
>   ELC      src/font-tests.elc
> >>Error occurred processing src/fileio-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/\301\203*macs/test/src/fileio-tests.elc7HRcu0"))
> 
> So...

This all happens because the directory name doesn't correspond to the
locale.  You need to create the directory in the 8859-1 locale.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 10:57:02 GMT) Full text and rfc822 format available.

Message #39 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 12:55:55 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> All the tools under Linux are so utf-8-focused these days...  let's
>> see...  I first, under a utf-8 locale created the directory "émacs",
>> then converted it to 8859-1:
>
> No, please create the directory with non-ASCII name _after_ switching
> the locale to Latin-1.

Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?
The reason I did it this convoluted name was just that I couldn't
convince my system to make a 8859 name even after changing the locale.
That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
(i.e., in utf-8) instead of a single-byte é.

But I think I know why "make check" was failing:

[larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
sv_SE.ISO-8859-1
[larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
en_US.UTF-8

The tests that were failing all talked about "chmod" and stuff, so I'm
guessing they were from a sub shell, and my system is apparently forcing
all new shells to use UTF-8...  And that was because I set the variables
in .bashrc.  I've now made them be 8859 also in sub-shells, but
unfortunately that doesn't help (it was a long shot, anyway -- these
aren't interactive shells, so .bashrc shouldn't be consulted).

make check:

>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))

This time over, the directory is "fóo" (in latin-1), and that looks like
Emacs is trying to find the utf-8 version of the file name.

So it looks like the patch set has problems, and needs further fixes.
(Or "make check" has some problems here, since Emacs otherwise seems to
work fine.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 11:06:02 GMT) Full text and rfc822 format available.

Message #42 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 14:05:26 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 12:55:55 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> All the tools under Linux are so utf-8-focused these days...  let's
> >> see...  I first, under a utf-8 locale created the directory "émacs",
> >> then converted it to 8859-1:
> >
> > No, please create the directory with non-ASCII name _after_ switching
> > the locale to Latin-1.
> 
> Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?

No, because the Linux file I/O APIs are encoding-agnostic, they will
(AFAIK) create the directory with a name that is the exact byte stream
that you type at the mkdir command (or at the Emacs make-directory).

> The reason I did it this convoluted name was just that I couldn't
> convince my system to make a 8859 name even after changing the locale.
> That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
> (i.e., in utf-8) instead of a single-byte é.

Try doing this in Emacs, and use one of the Latin input methods if the
keyboard doesn't cooperate.

> But I think I know why "make check" was failing:
> 
> [larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
> sv_SE.ISO-8859-1
> [larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
> en_US.UTF-8

I don't understand this: 2 identical commands one after the other
yield different results?

> The tests that were failing all talked about "chmod" and stuff, so I'm
> guessing they were from a sub shell, and my system is apparently forcing
> all new shells to use UTF-8...

Really?  So there's no way to change the locale to something
non UTF-8?

> make check:
> 
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))
> 
> This time over, the directory is "fóo" (in latin-1), and that looks like
> Emacs is trying to find the utf-8 version of the file name.

If that's the case, then we lack ENCODE_FILE (or more generally don't
encode a file name) somewhere.

> So it looks like the patch set has problems, and needs further fixes.
> (Or "make check" has some problems here, since Emacs otherwise seems to
> work fine.)

We could also just install the changes and wait for bug reports, on
the assumption that the problems you see aren't real.  Your call.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 11:28:01 GMT) Full text and rfc822 format available.

Message #45 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 13:27:28 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> But I think I know why "make check" was failing:
>> 
>> [larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
>> sv_SE.ISO-8859-1
>> [larsi <at> stories ~/src/emacs/trunk]$ echo $LANG
>> en_US.UTF-8
>
> I don't understand this: 2 identical commands one after the other
> yield different results?

Sorry, there was a "bash" started in between there.

>> This time over, the directory is "fóo" (in latin-1), and that looks like
>> Emacs is trying to find the utf-8 version of the file name.
>
> If that's the case, then we lack ENCODE_FILE (or more generally don't
> encode a file name) somewhere.

After instrumenting bytecomp (i.e., adding a bunch of messages), I see
what function is actually failing.  With this in byte-compile-file:

                  (message "foo2: %S" (prin1-to-string tempfile))
		  (unless (= temp-modes desired-modes)
		    (set-file-modes tempfile desired-modes 'nofollow))
                  (message "foo1: %S" (prin1-to-string tempfile))

I get this output:

make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
  ELC      lisp/eshell/eshell-tests.elc
foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1

So it's created a tempfile, tagged with the correct charset (I had no
idea that that's how it worked), but decoded, and then set-file-modes
interprets that as an UTF-8 file name.

So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
same problem.

That weird file name (decoded and tagged with a charset text parameter)
comes from make-temp-file -- everything seems to be OK before that.
target-file is:

foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""

which seems to be correct, but

		       (tempfile
			(make-temp-file (expand-file-name target-file)))

is

"#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

and then things fail.  Which makes me wonder why building Emacs at all
works if it's such a fundamental problem...  Just to check whether my
system is switching the LANG back to utf-8:

          (message "foo: %S" (getenv "LC_ALL"))

in byte-compile-file says

foo: "sv_SE.ISO-8859-1"

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 12:25:01 GMT) Full text and rfc822 format available.

Message #48 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 15:24:14 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 13:27:28 +0200
> 
> make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
>   ELC      lisp/eshell/eshell-tests.elc
> foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
> make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1
> 
> So it's created a tempfile, tagged with the correct charset (I had no
> idea that that's how it worked), but decoded, and then set-file-modes
> interprets that as an UTF-8 file name.
> 
> So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
> same problem.

There be dragons ;-)

The problematic aspect of debugging these problems is that what you
see is not always what's there, due to display and decoding/encoding
operations by both Emacs and the display software you have on your
system (which drives the terminal).

In particular, strings inside Emacs are always in UTF-8-compatible
encoding, so the fact you get UTF-8 in *Messages* doesn't prove
anything.  What we need is to find 2 types of possible problems:

  . raw bytes from Latin-1 encoding inside Emacs buffers or strings
    that are supposed to be decoded
  . UTF-8 encoded (instead of Latin-1 encoded) characters passed to
    libc functions

So if you found that the problem reveals itself in set-file-modes,
let's see what happens there.  The relevant code is this:

  char *fname = SSDATA (ENCODE_FILE (absname));
  mode_t imode = XFIXNUM (mode) & 07777;
  if (fchmodat (AT_FDCWD, fname, imode, nofollow) != 0)
    report_file_error ("Doing chmod", absname);

Please either run this under GDB, or add printf's, to show the byte
sequences of 'absname' and of 'fname'.  The former should be in UTF-8
(so you should see 0xC3 and 0xB3 for the ó character), the latter
should be in Latin-1 (so you should see 0xF3 for the same letter).
This should give us some hints wrt where to look for the cause of the
problem.

> That weird file name (decoded and tagged with a charset text parameter)
> comes from make-temp-file -- everything seems to be OK before that.
> target-file is:
> 
> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
> 
> which seems to be correct,

Where does the "foo:" printout comes from?  I wouldn't expect to see
Latin-1 encoded strings inside Emacs, not normally anyway.

> but
> 
> 		       (tempfile
> 			(make-temp-file (expand-file-name target-file)))
> 
> is
> 
> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

I see nothing wrong here: this is how decoding works in Emacs.  And
again, how did you produce this string?  As I explained above, the
details of how you display these strings matter in this case.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 12:34:01 GMT) Full text and rfc822 format available.

Message #51 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 14:33:08 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> So if you found that the problem reveals itself in set-file-modes,
> let's see what happens there.  The relevant code is this:

Yeah, I don't think that function is the problem in itself, but I don't
know where the problem originates either.

>> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
>> 
>> which seems to be correct,
>
> Where does the "foo:" printout comes from?  I wouldn't expect to see
> Latin-1 encoded strings inside Emacs, not normally anyway.

I just added a bunch of

          (message "foo: %S" variable)

here and there in byte-compile-file to watch how the passed-in string is
transformed. 

>> 		       (tempfile
>> 			(make-temp-file (expand-file-name target-file)))
>> 
>> is
>> 
>> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>
> I see nothing wrong here: this is how decoding works in Emacs.  And
> again, how did you produce this string?  As I explained above, the
> details of how you display these strings matter in this case.

Same way as above.

The file name is on the "f\\363o/test" form until make-temp-name, and
then it turns into a different string with a text property.  But I don't
know how much this is an artefact of how Emacs prints these things and
how much it's actually, er...  actual.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 12:40:02 GMT) Full text and rfc822 format available.

Message #54 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 14:39:07 +0200

[Message part 1 (text/plain, inline)]

Another confusing data point.  If I say "make" in the test directory, I
get:

foo 1: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
foo 2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcGvbK3T\" 0 65 (charset iso-8859-1))"

If I just say "make" in the main directory, I get this:

foo 1: "\"/home/larsi/src/emacs/f�o/lisp/dos-w32.elc\""
foo 2: "\"/home/larsi/src/emacs/fóo/lisp/dos-w32.elcXgukAl\""

Or, if that doesn't survive emailing, here's an umage:

[Message part 2 (image/png, inline)]

[Message part 3 (text/plain, inline)]

Note -- no text properties, and not represented as "f\363o".

*scratches head*

So is this a problem with how ert calls the byte compiler after all?

This is with

diff --git a/lisp/emacs-lisp/bytecomp.el b/lisp/emacs-lisp/bytecomp.el
index 966990bac9..07448033ac 100644
--- a/lisp/emacs-lisp/bytecomp.el
+++ b/lisp/emacs-lisp/bytecomp.el
@@ -1990,6 +1990,7 @@ byte-compile-file
 	(with-current-buffer output-buffer
 	  (goto-char (point-max))
 	  (insert "\n")			; aaah, unix.
+          (message "foo 1: %S" (prin1-to-string (expand-file-name target-file)))
 	  (if (file-writable-p target-file)
 	      ;; We must disable any code conversion here.
 	      (progn
@@ -2007,6 +2008,7 @@ byte-compile-file
 			(cons (lambda () (ignore-errors
 					   (delete-file tempfile)))
 			      kill-emacs-hook)))
+		  (message "foo 2: %S" (prin1-to-string tempfile))
 		  (unless (= temp-modes desired-modes)
 		    (set-file-modes tempfile desired-modes 'nofollow))
 		  (write-region (point-min) (point-max) tempfile nil 1)


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 12:42:01 GMT) Full text and rfc822 format available.

Message #57 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 15:41:16 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 14:33:08 +0200
> 
> The file name is on the "f\\363o/test" form until make-temp-name

That shouldn't happen.  It probably means we lack a DECODE_FILE
somewhere.  File names inside Emacs should always be decoded into
UTF-8.

> and
> then it turns into a different string with a text property.  But I don't
> know how much this is an artefact of how Emacs prints these things and
> how much it's actually, er...  actual.

The only way to know is to add printf's or look in GDB.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 12:46:02 GMT) Full text and rfc822 format available.

Message #60 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 15:45:18 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 14:39:07 +0200
> 
> So is this a problem with how ert calls the byte compiler after all?

I don't think so, but I'm not sure.  It could be some shenanigans of
expand-file-name, for example: it has its own ideas for when to
produce a unibyte string and when a multibyte string.

Again, the fact that "foo 1" displays a unibyte undecoded file name
sounds wrong to me.  Is target-file also a unibyte Latin-1 string?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 14:19:02 GMT) Full text and rfc822 format available.

Message #63 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 16:18:32 +0200

I'm just poking around to see what's different between the way the files
are compiled in the test directory and the lisp directory, because they
should either both fail or not.

So here's how "make" i test does it:

EMACSLOADPATH= LC_ALL=C EMACS_TEST_DIRECTORY=/home/larsi/src/emacs/f�o/test  "../src/emacs" --module-assertions --no-init-file --no-site-file --no-site-lisp -L ":."  --batch -f batch-byte-compile lisp/eshell/eshell-tests.el

Here's how "make" in Lisp does it:

EMACSLOADPATH= '../src/emacs' -batch --no-site-file --no-site-lisp --eval '(setq load-prefer-newer t)'  -f batch-byte-compile emacs-lisp/bytecomp.el

And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
successfully.

*phew*

Hm...  in fact, everything compiles successfully without LC_ALL?

However, when the tests run (in the latin-1 environment) 11 tests fail:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4200 tests, 4097 results as expected, 29 unexpected, 74 skipped
1 files did not contain any tests:
  src/emacs-module-tests.log
11 files contained unexpected results:
  src/regex-emacs-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/vc/diff-mode-tests.log
  lisp/time-stamp-tests.log
  lisp/net/shr-tests.log
  lisp/gnus/mml-sec-tests.log
  lisp/epg-tests.log
  lisp/emacs-lisp/package-tests.log
  lisp/emacs-lisp/faceup-tests/faceup-test-files.log
  lisp/cedet/semantic-utest-ia.log
  lib-src/emacsclient-tests.log

As a comparison, removing the LC_ALL in an utf-8 environment (with a
pure-ascii path) gives me:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4231 tests, 4150 results as expected, 6 unexpected, 75 skipped
6 files contained unexpected results:
  src/emacs-module-tests.log
  src/callint-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/subr-tests.log
  lisp/files-tests.log
  lisp/emacs-lisp/gv-tests.log

The bzr test fails because of the brz/bzr thing, but the LC_ALL is
apparently needed for the other five things.

So: In conclusion, I this Glenn's patch needs more work before
applying.  :-)  But at least we now knows that it breaks, and why (well,
for some of it).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 14:28:02 GMT) Full text and rfc822 format available.

Message #66 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 16:27:30 +0200

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
> successfully.

Oh, wow.  Apparently nobody is using non-ASCII in their Emacs paths?  I
just did a "mv trunk góo" on my laptop (UTF-8 environment), nothing
altered from out-of-the-box on Debian bullseye, and make check:

>>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1

for all the files.

So the LC_ALL=C thing in the compilation phase is just...  wrong?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 14:47:01 GMT) Full text and rfc822 format available.

Message #69 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 17:46:14 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 16:27:30 +0200
> 
> >>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
> make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1
> 
> for all the files.
> 
> So the LC_ALL=C thing in the compilation phase is just...  wrong?

It's probably not TRT when the directory is non-ASCII.  But note that
you can say

   make check TEST_LOCALE=<whatever>

Does it help to use the locale you have set?

"git log -L" indicates that the default setting of TEST_LOCALE=C was
introduced in commit 4874f0b.  It would be interesting to see what the
tests mentioned in the log message of that commit yield if the locale
is not C.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 14:55:01 GMT) Full text and rfc822 format available.

Message #72 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 16:54:46 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> It's probably not TRT when the directory is non-ASCII.

Sure.

> But note that you can say
>
>    make check TEST_LOCALE=<whatever>
>
> Does it help to use the locale you have set?

That allows the files to be compiled, but some tests fail:

Files examined: 305
Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
5 files contained unexpected results:
  src/emacs-module-tests.log
  src/callint-tests.log
  lisp/subr-tests.log
  lisp/net/tramp-archive-tests.log
  lisp/emacs-lisp/gv-tests.log

> "git log -L" indicates that the default setting of TEST_LOCALE=C was
> introduced in commit 4874f0b.  It would be interesting to see what the
> tests mentioned in the log message of that commit yield if the locale
> is not C.

Hm...  seems like that commit just made it optional.  Looks like the
LC_ALL=C has been there from the very beginning, which means that in all
these years, nobody has tried "make check" with non-ASCII chars in their
paths.  :-)

commit d221e7808c01fdc9234734f95ecf49e902085ddd
Author:     Christian Ohler <ohler <at> gnu.org>
AuthorDate: Thu Jan 13 03:08:24 2011 +1100
Commit:     Christian Ohler <ohler <at> gnu.org>
CommitDate: Thu Jan 13 03:08:24 2011 +1100

    Add ERT, a tool for automated testing in Emacs Lisp.
    
    * Makefile.in, configure.in, doc/misc/Makefile.in, doc/misc/makefile.w32-in:
    Add ERT.  Make "make check" run tests in test/automated.
    
    * doc/misc/ert.texi, lisp/emacs-lisp/ert.el, lisp/emacs-lisp/ert-x.el:
    New files.
    
    * test/automated: New directory.

diff --git a/test/automated/Makefile.in b/test/automated/Makefile.in
--- /dev/null
+++ b/test/automated/Makefile.in
@@ -0,0 +47,2 @@
+# The actual Emacs command run in the targets below.
+emacs = EMACSLOADPATH=$(lispsrc):$(test) LC_ALL=C $(EMACS) $(EMACSOPT)


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Fri, 11 Sep 2020 15:12:02 GMT) Full text and rfc822 format available.

Message #75 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Fri, 11 Sep 2020 18:11:32 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: rgm <at> gnu.org,  15803 <at> debbugs.gnu.org
> Date: Fri, 11 Sep 2020 16:54:46 +0200
> 
> >    make check TEST_LOCALE=<whatever>
> >
> > Does it help to use the locale you have set?
> 
> That allows the files to be compiled, but some tests fail:
> 
> Files examined: 305
> Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
> 5 files contained unexpected results:
>   src/emacs-module-tests.log
>   src/callint-tests.log
>   lisp/subr-tests.log
>   lisp/net/tramp-archive-tests.log
>   lisp/emacs-lisp/gv-tests.log

Maybe these tests expect some special locale.  For example,
emacs-module-tests could expect UTF-8, since we don't support
non-UTF-8 strings in modules.

Anyway, I think if this is down to a couple of tests, we can install
the changes, as the problems they uncover are elsewhere.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Sat, 12 Sep 2020 08:48:01 GMT) Full text and rfc822 format available.

Message #78 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Michael Albinus <michael.albinus <at> gmx.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, Lars Ingebrigtsen <larsi <at> gnus.org>, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Sat, 12 Sep 2020 10:47:21 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> That allows the files to be compiled, but some tests fail:
>>
>> Files examined: 305
>> Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
>> 5 files contained unexpected results:
>>   src/emacs-module-tests.log
>>   src/callint-tests.log
>>   lisp/subr-tests.log
>>   lisp/net/tramp-archive-tests.log
>>   lisp/emacs-lisp/gv-tests.log
>
> Maybe these tests expect some special locale.  For example,
> emacs-module-tests could expect UTF-8, since we don't support
> non-UTF-8 strings in modules.

UTF8 is also required for tramp-archive-tests, IIRC (not checked actually).

> Anyway, I think if this is down to a couple of tests, we can install
> the changes, as the problems they uncover are elsewhere.

Agreed. If needed, I could adapt tramp-archive-tests. I cannot speak for
the other tests.

Best regards, Michael.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15803; Package emacs. (Sat, 12 Sep 2020 11:23:02 GMT) Full text and rfc822 format available.

Message #81 received at 15803 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rgm <at> gnu.org, 15803 <at> debbugs.gnu.org
Subject: Re: bug#15803: default-file-name-coding-system: utf-8 better than
 latin-1 these days?
Date: Sat, 12 Sep 2020 13:21:49 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> Maybe these tests expect some special locale.  For example,
> emacs-module-tests could expect UTF-8, since we don't support
> non-UTF-8 strings in modules.
>
> Anyway, I think if this is down to a couple of tests, we can install
> the changes, as the problems they uncover are elsewhere.

Yeah, that's true -- since "make check" has seemingly never worked well
with a non-ASCII path, then the patch doesn't really regress anything
much (although the number of tests that fail with non-ASCII paths
increase).

OK, I'll apply the patch (after test-compiling on a couple systems), and
open a new bug report for the non-ASCII path/"make check" thing.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) fixed. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 12 Sep 2020 11:38:01 GMT) Full text and rfc822 format available.

bug marked as fixed in version 28.1, send any further explanations to 15803 <at> debbugs.gnu.org and Glenn Morris <rgm <at> gnu.org> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 12 Sep 2020 11:38:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 11 Oct 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 281 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #15803 default-file-name-coding-system: utf-8 better than latin-1 these days?

GNU bug report logs - #15803
default-file-name-coding-system: utf-8 better than latin-1 these days?