GNU bug report logs - #56469
29.0.50; Unibyte dir in directory_files_internal

Previous Next

Package: emacs;

Reported by: Stefan Monnier <monnier <at> iro.umontreal.ca>

Date: Sat, 9 Jul 2022 17:46:01 UTC

Severity: normal

Found in version 29.0.50

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 56469 in the body.
You can then email your comments to 56469 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sat, 09 Jul 2022 17:46:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 09 Jul 2022 17:46:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.0.50; Unibyte dir in directory_files_internal
Date: Sat, 09 Jul 2022 13:44:52 -0400
Package: Emacs
Version: 29.0.50


If you have a directory named "/tmp/\303a" with a file named "fée"
inside, then (directory-files "/tmp/\303a" 'full) is likely to return
a funny string which is multibyte but contains an invalid
utf-8 sequence (its bytes spell "/tmp/\303a/f\303\251e").
That strings seems to be printed as "/tmp/¡/fée" which corresponds
to "/tmp/\303\241/f\303\251e".

Such a string with an invalid UTF-8 sequence is handled quite graciously
by Emacs, so I wasn't able to get an actual crash out of it, but it's
still something we should avoid.

I suggest the patch below.  In a comment I suggest we don't try to use
unibyte strings when a multibyte string would work as well.  This is
because for those ASCII-only strings, it's cheaper to test bytes==chars
to (re)discover that they are ASCII-only (when they're multibyte) than
having to loop through the bytes (when they're unibyte).


        Stefan


diff --git a/src/dired.c b/src/dired.c
index 6bb8c2fcb9f..33ddfafd8e7 100644
--- a/src/dired.c
+++ b/src/dired.c
@@ -219,6 +219,13 @@ directory_files_internal (Lisp_Object directory, Lisp_Object full,
     }
 #endif
 
+  if (!NILP (full) && !STRING_MULTIBYTE (directory))
+    { /* We will be concatenating 'directory' with local file name.
+         We always decode local file names, so in order to safely concatenate
+         them we need 'directory' to be multibyte.  */
+      directory = Fstring_to_multibyte (directory);
+    }
+
   ptrdiff_t directory_nbytes = SBYTES (directory);
   re_match_object = Qt;
 
@@ -263,9 +270,10 @@ directory_files_internal (Lisp_Object directory, Lisp_Object full,
 	  ptrdiff_t name_nbytes = SBYTES (name);
 	  ptrdiff_t nbytes = directory_nbytes + needsep + name_nbytes;
 	  ptrdiff_t nchars = SCHARS (directory) + needsep + SCHARS (name);
-	  finalname = make_uninit_multibyte_string (nchars, nbytes);
-	  if (nchars == nbytes)
-	    STRING_SET_UNIBYTE (finalname);
+	  /* FIXME: Why not make them all multibyte?  */
+	  finalname = (nchars == nbytes)
+	              ? make_uninit_string (nchars, nbytes)
+	              : make_uninit_multibyte_string (nchars, nbytes);
 	  memcpy (SDATA (finalname), SDATA (directory), directory_nbytes);
 	  if (needsep)
 	    SSET (finalname, directory_nbytes, DIRECTORY_SEP);





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sat, 09 Jul 2022 18:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sat, 09 Jul 2022 21:17:22 +0300
> Date: Sat, 09 Jul 2022 13:44:52 -0400
> From:  Stefan Monnier via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> If you have a directory named "/tmp/\303a" with a file named "fée"
> inside, then (directory-files "/tmp/\303a" 'full) is likely to return
> a funny string which is multibyte but contains an invalid
> utf-8 sequence (its bytes spell "/tmp/\303a/f\303\251e").
> That strings seems to be printed as "/tmp/¡/fée" which corresponds
> to "/tmp/\303\241/f\303\251e".
> 
> Such a string with an invalid UTF-8 sequence is handled quite graciously
> by Emacs, so I wasn't able to get an actual crash out of it, but it's
> still something we should avoid.
> 
> I suggest the patch below.  In a comment I suggest we don't try to use
> unibyte strings when a multibyte string would work as well.  This is
> because for those ASCII-only strings, it's cheaper to test bytes==chars
> to (re)discover that they are ASCII-only (when they're multibyte) than
> having to loop through the bytes (when they're unibyte).

Please bootstrap Emacs in a directory with such a name, and if that
works, I'm okay with installing this change.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sat, 09 Jul 2022 18:21:02 GMT) Full text and rfc822 format available.

Message #11 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sat, 09 Jul 2022 14:20:37 -0400
>> I suggest the patch below.  In a comment I suggest we don't try to use
>> unibyte strings when a multibyte string would work as well.  This is
>> because for those ASCII-only strings, it's cheaper to test bytes==chars
>> to (re)discover that they are ASCII-only (when they're multibyte) than
>> having to loop through the bytes (when they're unibyte).
>
> Please bootstrap Emacs in a directory with such a name, and if that
> works, I'm okay with installing this change.

Just to clarify: by "this change" you refer to the change in the patch
or the change suggested in the comment?


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sat, 09 Jul 2022 18:54:01 GMT) Full text and rfc822 format available.

Message #14 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sat, 09 Jul 2022 21:53:25 +0300
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 56469 <at> debbugs.gnu.org
> Date: Sat, 09 Jul 2022 14:20:37 -0400
> 
> >> I suggest the patch below.  In a comment I suggest we don't try to use
> >> unibyte strings when a multibyte string would work as well.  This is
> >> because for those ASCII-only strings, it's cheaper to test bytes==chars
> >> to (re)discover that they are ASCII-only (when they're multibyte) than
> >> having to loop through the bytes (when they're unibyte).
> >
> > Please bootstrap Emacs in a directory with such a name, and if that
> > works, I'm okay with installing this change.
> 
> Just to clarify: by "this change" you refer to the change in the patch
> or the change suggested in the comment?

I meant the patch.  The comment I didn't understand at all.  It seemed
to be unrelated to the code and the change you were proposing.




Added tag(s) patch. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Sun, 10 Jul 2022 02:24:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 14:24:01 GMT) Full text and rfc822 format available.

Message #19 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 10:23:28 -0400
> Please bootstrap Emacs in a directory with such a name, and if that
> works, I'm okay with installing this change.

Pushed, thanks.

W.r.t to the comment, it's indeed unrelated to the patch (other than
the fact that it touches the same code).  The question is when we do:

	  finalname = (nchars == nbytes)
	              ? make_uninit_string (nbytes)
	              : make_uninit_multibyte_string (nchars, nbytes);

the actual bytes are "decoded" (i.e. in our internal UTF-8 encoding), so
(nchars == nbytes) checks whether its "pure ASCII" or not and if it's
pure ASCII we return a unibyte string.

Our file-name manipulation routines always consider unibyte-ASCII and
multibyte-ASCII as "equivalent", and indeed DECODE_FILE and ENCODE_FILE
take advantage of that so as to return their argument as-is when it's
all-ASCII so as to avoid allocating a string unnecessarily.

So in the above code snippet, when the string is all-ASCII, we actually
have a choice, and both a unibyte string and a multibyte string should
work.  Currently in that case we return a unibyte string, but I think in
such cases we're better off returning a multibyte string because the
subsequent "all-ASCII" test (that DE/ENCODE_FILE will perform when we
pass that filename to some further operation) will be more efficient
(it's a constant-time (nchars == nbytes) test whereas when the string is
unibyte it requires looking at each and every byte).

IOW, while it makes sense to return a "decoded unibyte" string from
DECODE_FILE in order to avoid an allocation, I don't think it makes
sense to return such a "decoded unibyte" string when we have to allocate
a new string anyway.


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 14:33:02 GMT) Full text and rfc822 format available.

Message #22 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 17:32:17 +0300
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 56469 <at> debbugs.gnu.org
> Date: Sun, 10 Jul 2022 10:23:28 -0400
> 
> W.r.t to the comment, it's indeed unrelated to the patch (other than
> the fact that it touches the same code).  The question is when we do:
> 
> 	  finalname = (nchars == nbytes)
> 	              ? make_uninit_string (nbytes)
> 	              : make_uninit_multibyte_string (nchars, nbytes);
> 
> the actual bytes are "decoded" (i.e. in our internal UTF-8 encoding), so
> (nchars == nbytes) checks whether its "pure ASCII" or not and if it's
> pure ASCII we return a unibyte string.

I don't think this is true, because early during startup we don't yet
have the coding-systems set up, and so the file names are unibyte and
undecoded.  So that place in dired.c doesn't only handle ASCII when it
sees that ncahrs == nbytes.

> So in the above code snippet, when the string is all-ASCII, we actually
> have a choice, and both a unibyte string and a multibyte string should
> work.  Currently in that case we return a unibyte string, but I think in
> such cases we're better off returning a multibyte string because the
> subsequent "all-ASCII" test (that DE/ENCODE_FILE will perform when we
> pass that filename to some further operation) will be more efficient
> (it's a constant-time (nchars == nbytes) test whereas when the string is
> unibyte it requires looking at each and every byte).
> 
> IOW, while it makes sense to return a "decoded unibyte" string from
> DECODE_FILE in order to avoid an allocation, I don't think it makes
> sense to return such a "decoded unibyte" string when we have to allocate
> a new string anyway.

I'm not necessarily opposed to decide that ASCII strings should be
multibyte, but doing so for file names will need careful auditing of
the sources with the startup process in mind.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 14:59:02 GMT) Full text and rfc822 format available.

Message #25 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 10:58:30 -0400
Eli Zaretskii [2022-07-10 17:32:17] wrote:

>> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
>> Cc: 56469 <at> debbugs.gnu.org
>> Date: Sun, 10 Jul 2022 10:23:28 -0400
>> 
>> W.r.t to the comment, it's indeed unrelated to the patch (other than
>> the fact that it touches the same code).  The question is when we do:
>> 
>> 	  finalname = (nchars == nbytes)
>> 	              ? make_uninit_string (nbytes)
>> 	              : make_uninit_multibyte_string (nchars, nbytes);
>> 
>> the actual bytes are "decoded" (i.e. in our internal UTF-8 encoding), so
>> (nchars == nbytes) checks whether its "pure ASCII" or not and if it's
>> pure ASCII we return a unibyte string.
>
> I don't think this is true, because early during startup we don't yet
> have the coding-systems set up, and so the file names are unibyte and
> undecoded.  So that place in dired.c doesn't only handle ASCII when it
> sees that ncahrs == nbytes.

Hmm... the early startup is actually not a worry here (according to my
tests `directory_files_internal` is first called when we get to
native-compile the macroexp/bytecomp, at which point all our coding
systems have been setup).

But indeed, if the file name coding system is something like `binary`,
DECODE_FILE will always return a unibyte string, so we may have non-ASCII
bytes when (nchars == nbytes).
Thanks, I'll update the comment accordingly.


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 15:08:02 GMT) Full text and rfc822 format available.

Message #28 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 18:07:24 +0300
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 56469 <at> debbugs.gnu.org
> Date: Sun, 10 Jul 2022 10:58:30 -0400
> 
> Eli Zaretskii [2022-07-10 17:32:17] wrote:
> 
> > I don't think this is true, because early during startup we don't yet
> > have the coding-systems set up, and so the file names are unibyte and
> > undecoded.  So that place in dired.c doesn't only handle ASCII when it
> > sees that ncahrs == nbytes.
> 
> Hmm... the early startup is actually not a worry here (according to my
> tests `directory_files_internal` is first called when we get to
> native-compile the macroexp/bytecomp, at which point all our coding
> systems have been setup).

That could be the situation _today_, but that's just sheer luck (or
lack thereof).  In general, all the file-handling code we have in
fileio.c and dired.c should be equally prepared to handle unibyte
non-ASCII file names and multibyte file names, because we may need
that any time.  When we make changes in Emacs, we shouldn't be worried
whether those changes could cause some dired.c code be called early on
during Emacs startup.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 15:20:02 GMT) Full text and rfc822 format available.

Message #31 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 11:19:22 -0400
> That could be the situation _today_, but that's just sheer luck (or
> lack thereof).  In general, all the file-handling code we have in
> fileio.c and dired.c should be equally prepared to handle unibyte
> non-ASCII file names and multibyte file names, because we may need
> that any time.  When we make changes in Emacs, we shouldn't be worried
> whether those changes could cause some dired.c code be called early on
> during Emacs startup.

Agreed.  In the updated comment I noted that we have a bug when we do

    (let ((file-name-coding-system 'binary))
      (directory-files "/tmp/été/" 'full)

because we'll be concatenating the multibyte string "/tmp/été/" with
the undecoded unibyte strings of the names of files in that directory.


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 15:43:01 GMT) Full text and rfc822 format available.

Message #34 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 18:41:38 +0300
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 56469 <at> debbugs.gnu.org
> Date: Sun, 10 Jul 2022 11:19:22 -0400
> 
> In the updated comment I noted that we have a bug when we do
> 
>     (let ((file-name-coding-system 'binary))
>       (directory-files "/tmp/été/" 'full)
> 
> because we'll be concatenating the multibyte string "/tmp/été/" with
> the undecoded unibyte strings of the names of files in that directory.

I don't think file-name related stuff can work in Emacs when
file-name-coding-system is set to an arbitrary value not reflecting
the reality.  Why would we want to support such cases?

(But I don't object to the comment.)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Sun, 10 Jul 2022 22:14:02 GMT) Full text and rfc822 format available.

Message #37 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Sun, 10 Jul 2022 18:13:39 -0400
> I don't think file-name related stuff can work in Emacs when
> file-name-coding-system is set to an arbitrary value not reflecting
> the reality.

I'd tend to agree (tho 'binary' does sound like a valid value which
should work in all cases under GNU/Linux).

I'm just a bit annoyed at the idea that ELisp code can end up
constructing a multibyte string whose bytes contain invalid utf-8
sequences, because I suspect we may end up with a core dump somewhere in
such a circumstance.


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Mon, 11 Jul 2022 02:29:01 GMT) Full text and rfc822 format available.

Message #40 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Mon, 11 Jul 2022 05:27:39 +0300
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 56469 <at> debbugs.gnu.org
> Date: Sun, 10 Jul 2022 18:13:39 -0400
> 
> > I don't think file-name related stuff can work in Emacs when
> > file-name-coding-system is set to an arbitrary value not reflecting
> > the reality.
> 
> I'd tend to agree (tho 'binary' does sound like a valid value which
> should work in all cases under GNU/Linux).

'binary' exactly means that you end up with unibyte strings and with
raw bytes in multibyte strings.

> I'm just a bit annoyed at the idea that ELisp code can end up
> constructing a multibyte string whose bytes contain invalid utf-8
> sequences, because I suspect we may end up with a core dump somewhere in
> such a circumstance.

Emacs should cope with this without dumping core, but the resulting
file names might not be readable by humans nor friendly to other
programs.




Removed tag(s) patch. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 05 Sep 2022 19:21:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56469; Package emacs. (Mon, 05 Sep 2022 19:22:02 GMT) Full text and rfc822 format available.

Message #45 received at 56469 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>, 56469 <at> debbugs.gnu.org
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Mon, 05 Sep 2022 21:21:36 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

> (But I don't object to the comment.)

Skimming this bug report lightly, it seems like the proposed patch was
applied, but then the discussion continued.  It's not clear to me
whether there's more to be done here -- should this report be closed?





Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Wed, 07 Sep 2022 13:33:02 GMT) Full text and rfc822 format available.

Notification sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
bug acknowledged by developer. (Wed, 07 Sep 2022 13:33:02 GMT) Full text and rfc822 format available.

Message #50 received at 56469-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 56469-done <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#56469: 29.0.50; Unibyte dir in directory_files_internal
Date: Wed, 07 Sep 2022 16:32:17 +0300
> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>,  56469 <at> debbugs.gnu.org
> Date: Mon, 05 Sep 2022 21:21:36 +0200
> 
> Skimming this bug report lightly, it seems like the proposed patch was
> applied, but then the discussion continued.  It's not clear to me
> whether there's more to be done here -- should this report be closed?

Yes, I think so.  Done.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 06 Oct 2022 11:24:15 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 196 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.