GNU bug report logs - #41657
md5sum: odd escaping for input filename \

Previous Next

Package: coreutils;

Reported by: Michael Coleman <mcolema5 <at> uoregon.edu>

Date: Tue, 2 Jun 2020 02:48:02 UTC

Severity: normal

Done: Bob Proulx <bob <at> proulx.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 41657 in the body.
You can then email your comments to 41657 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#41657; Package coreutils. (Tue, 02 Jun 2020 02:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Coleman <mcolema5 <at> uoregon.edu>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 02 Jun 2020 02:48:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Coleman <mcolema5 <at> uoregon.edu>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: md5sum: odd escaping for input filename \
Date: Tue, 2 Jun 2020 02:17:13 +0000
[Message part 1 (text/plain, inline)]
Apologies if this has already been fixed, but glancing at the source, probably not.

For version 8.22:

$ true > \\
$ md5sum \\
\d41d8cd98f00b204e9800998ecf8427e  \\
$ md5sum < \\
d41d8cd98f00b204e9800998ecf8427e  -

The checksum is not what I would expect, due to the leading backslash.  And in any case, the "\d" has no obvious interpretation.  Really, I can't imagine ever escaping the checksum.

(Yes, my users are a clever people.)

Cheers,
Mike

Michael Coleman (mcolema5 <at> uoregon.edu<mailto:mcolema5 <at> uoregon.edu>)
Computational Scientist
Research Advanced Computing Services
University of Oregon


[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#41657; Package coreutils. (Tue, 02 Jun 2020 03:53:01 GMT) Full text and rfc822 format available.

Message #8 received at 41657 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Michael Coleman <mcolema5 <at> uoregon.edu>
Cc: 41657 <at> debbugs.gnu.org
Subject: Re: bug#41657: md5sum: odd escaping for input filename \
Date: Mon, 1 Jun 2020 21:52:39 -0600
Hello Michael,

Michael Coleman wrote:
> $ true > \\
> $ md5sum \\
> \d41d8cd98f00b204e9800998ecf8427e  \\
> $ md5sum < \\
> d41d8cd98f00b204e9800998ecf8427e  -

Thank you for the extremely good example!  It's excellent.

> The checksum is not what I would expect, due to the leading
> backslash.  And in any case, the "\d" has no obvious interpretation.
> Really, I can't imagine ever escaping the checksum.

As it turns out this is documented behavior.  Here is what the manual says:

     For each FILE, ‘md5sum’ outputs by default, the MD5 checksum, a
  space, a flag indicating binary or text input mode, and the file name.
  Binary mode is indicated with ‘*’, text mode with ‘ ’ (space).  Binary
  mode is the default on systems where it’s significant, otherwise text
  mode is the default.  Without ‘--zero’, if FILE contains a backslash or
  newline, the line is started with a backslash, and each problematic
  character in the file name is escaped with a backslash, making the
  output unambiguous even in the presence of arbitrary file names.  If
  FILE is omitted or specified as ‘-’, standard input is read.

Specifically it is this sentence.

  Without ‘--zero’, if FILE contains a backslash or newline, the line
  is started with a backslash, and each problematic character in the
  file name is escaped with a backslash, making the output unambiguous
  even in the presence of arbitrary file names.

And so the program is behaving as expected.  Which I am sure you will
not be happy about since this bug report about it.

Someone will correct me but I think the thinking is that the output of
md5sum is most useful when it can be checked with md5sum -c and
therefore the filename problem needed to be handled.  The trigger for
this escapes my memory.  But if you were to check the output with -c
then you would find this result with your test case.

  $ md5sum \\ | md5sum -c
  \: OK

And note that this applies to the other *sum programs too.

  The commands sha224sum, sha256sum, sha384sum and sha512sum compute
  checksums of various lengths (respectively 224, 256, 384 and 512
  bits), collectively known as the SHA-2 hashes. The usage and options
  of these commands are precisely the same as for md5sum and
  sha1sum. See md5sum invocation.

> (Yes, my users are a clever people.)

  I am so clever that sometimes I don't understand a single word of what I am saying -- Oscar Wilde

:-)

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#41657; Package coreutils. (Wed, 03 Jun 2020 00:23:01 GMT) Full text and rfc822 format available.

Message #11 received at 41657 <at> debbugs.gnu.org (full text, mbox):

From: Michael Coleman <mcolema5 <at> uoregon.edu>
To: Bob Proulx <bob <at> proulx.com>
Cc: "41657 <at> debbugs.gnu.org" <41657 <at> debbugs.gnu.org>
Subject: RE: bug#41657: md5sum: odd escaping for input filename \
Date: Tue, 2 Jun 2020 23:52:38 +0000
Hi Bob,

Thanks very much for your prompt reply.  Certainly, if this is documented behavior, it's not a bug.  I would have never thought to check the documentation as the behavior seems so strange.

If I understand correctly, the leading backslash in the first field is an indication that the second field is escaped.  (The first field never needs escapes, as far as I can see.)

Not sure I would have chosen this, but it can't really be changed now.  But, I suspect that almost no real shell script would deal with this escaping correctly.  Really, I'd be surprised if there were even one example.  If so, perhaps it could be changed without trouble.

In any case, thanks very much for your explanation.

Regards,
Mike



-----Original Message-----
From: Bob Proulx <bob <at> proulx.com> 
Sent: Monday, June 1, 2020 08:53 PM
To: Michael Coleman <mcolema5 <at> uoregon.edu>
Cc: 41657 <at> debbugs.gnu.org
Subject: Re: bug#41657: md5sum: odd escaping for input filename \

Hello Michael,

Michael Coleman wrote:
> $ true > \\
> $ md5sum \\
> \d41d8cd98f00b204e9800998ecf8427e  \\
> $ md5sum < \\
> d41d8cd98f00b204e9800998ecf8427e  -

Thank you for the extremely good example!  It's excellent.

> The checksum is not what I would expect, due to the leading
> backslash.  And in any case, the "\d" has no obvious interpretation.
> Really, I can't imagine ever escaping the checksum.

As it turns out this is documented behavior.  Here is what the manual says:

     For each FILE, ‘md5sum’ outputs by default, the MD5 checksum, a
  space, a flag indicating binary or text input mode, and the file name.
  Binary mode is indicated with ‘*’, text mode with ‘ ’ (space).  Binary
  mode is the default on systems where it’s significant, otherwise text
  mode is the default.  Without ‘--zero’, if FILE contains a backslash or
  newline, the line is started with a backslash, and each problematic
  character in the file name is escaped with a backslash, making the
  output unambiguous even in the presence of arbitrary file names.  If
  FILE is omitted or specified as ‘-’, standard input is read.

Specifically it is this sentence.

  Without ‘--zero’, if FILE contains a backslash or newline, the line
  is started with a backslash, and each problematic character in the
  file name is escaped with a backslash, making the output unambiguous
  even in the presence of arbitrary file names.

And so the program is behaving as expected.  Which I am sure you will
not be happy about since this bug report about it.

Someone will correct me but I think the thinking is that the output of
md5sum is most useful when it can be checked with md5sum -c and
therefore the filename problem needed to be handled.  The trigger for
this escapes my memory.  But if you were to check the output with -c
then you would find this result with your test case.

  $ md5sum \\ | md5sum -c
  \: OK

And note that this applies to the other *sum programs too.

  The commands sha224sum, sha256sum, sha384sum and sha512sum compute
  checksums of various lengths (respectively 224, 256, 384 and 512
  bits), collectively known as the SHA-2 hashes. The usage and options
  of these commands are precisely the same as for md5sum and
  sha1sum. See md5sum invocation.

> (Yes, my users are a clever people.)

  I am so clever that sometimes I don't understand a single word of what I am saying -- Oscar Wilde

:-)

Bob

Information forwarded to bug-coreutils <at> gnu.org:
bug#41657; Package coreutils. (Wed, 24 Jun 2020 21:34:01 GMT) Full text and rfc822 format available.

Message #14 received at 41657 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Michael Coleman <mcolema5 <at> uoregon.edu>
Cc: "41657 <at> debbugs.gnu.org" <41657 <at> debbugs.gnu.org>
Subject: Re: bug#41657: md5sum: odd escaping for input filename \
Date: Wed, 24 Jun 2020 15:33:49 -0600
close 41657
thanks

No one else has commented therefore I am closing the bug ticket.  But
the discussion may continue here.

Michael Coleman wrote:
> Thanks very much for your prompt reply.  Certainly, if this is
> documented behavior, it's not a bug.  I would have never thought to
> check the documentation as the behavior seems so strange.

I am not always so generous about documented behavior *never* being a
bug. :-)

> If I understand correctly, the leading backslash in the first field
> is an indication that the second field is escaped.  (The first field
> never needs escapes, as far as I can see.)

Right.  But it was available to clue in the md5sum and others that the
file name was an "unsafe" file name and was going to be escaped there.

> Not sure I would have chosen this, but it can't really be changed
> now.  But, I suspect that almost no real shell script would deal
> with this escaping correctly.  Really, I'd be surprised if there
> were even one example.  If so, perhaps it could be changed without
> trouble.

Let's talk about the shell scripting part.  Why would this ever need
to be parsed in a shell script?  And if so then that is precisely
where it would need to be done due to the file name!

Your own example was a file name that consisted of a single
backslash.  Since the backslash is the shell escape character then
handling that in a shell script would require escaping it properly
with a second backslash.

I will suggest that the primary use for the *sum utility output is as
input to the same utility later to check the content for differences.
That's arguably the primary use of it.

There are also cases where we will want to use the *sum utilities on a
single file.  That's fine.  I think the problematic case here might be
a usage like this usage.

  filename="\\"
  sum=$(md5sum "$filename" | awk '{print$1}')
  printf "%s\n" "$sum"
  \d41d8cd98f00b204e9800998ecf8427e

And then there is that extra backslash at the start of the hash.
Well, yes, that is unfortunate.  But in this case we already have the
filename in a variable and don't want the filename from md5sum.  This
is very similar to portability problems between different versions of
'wc' and other utilities too.  (Some 'wc' utils print leading spaces
and some do not.)

As you already deduced if md5sum does not have a file name then it
does not know if it is escaped or not.  Reading standard input instead
doesn't have a name and therefore "-" is used as a placeholder as per
the tradition.

  filename="\\"
  sum=$(md5sum < "$filename" | awk '{print$1}')
  printf "%s\n" "$sum"
  d41d8cd98f00b204e9800998ecf8427e

And because this is discussion I will note that the name is just one
of the possible names to a file.  Let's hard link it to a different
name.  And of course symbolic links are the same too.  A name is just
a pointer to a file.

  ln "$filename" foo
  md5sum foo
  d41d8cd98f00b204e9800998ecf8427e  foo

But I drift...

I think it likely you have already educated your people about the
problems and the solution was to read from stdin when the file name is
potentially untrusted "tainted" data.  (Since programming langauges
often refer to unknown untrusted data as "tainted" data for the
purpose of tracking what actions are safe upon it or not.  When taint
checking is enabled.)  Therefore if the name is unknown then it is
safer to avoid the name and use standard input.

And I suggest the same with other utilities such as 'wc' too.
Fortunately wc is not used to read back its own input.  Otherwise I am
sure someone would suggest that it would need the same escaping done
there too.  Example that thankfully does not actually exist:

  $ wc -l \\
  \0 \\

I am sure that if such a change were made that it would result in a
large wide spread breakage.  Let's hope that never happens.

Bob




bug closed, send any further explanations to 41657 <at> debbugs.gnu.org and Michael Coleman <mcolema5 <at> uoregon.edu> Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Wed, 24 Jun 2020 21:34:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#41657; Package coreutils. (Thu, 25 Jun 2020 16:39:01 GMT) Full text and rfc822 format available.

Message #19 received at 41657 <at> debbugs.gnu.org (full text, mbox):

From: Michael Coleman <mcolema5 <at> uoregon.edu>
To: Bob Proulx <bob <at> proulx.com>
Cc: "41657 <at> debbugs.gnu.org" <41657 <at> debbugs.gnu.org>
Subject: RE: bug#41657: md5sum: odd escaping for input filename \
Date: Thu, 25 Jun 2020 16:38:46 +0000
Not sure I have much useful to add, though per your example, it does seem surprising that the first output field can differ between

    md5sum "$filename"

and

    md5sum < "$filename"

Perhaps especially so since that only very rarely happens, and in all likelihood virtually no one knows of this behavior.

I do agree that the escape character usually won't make a difference.  It does make the checksum have a possibly variable length, though most code wouldn't care.  Some code (e.g., a call from a C program) could crash or clip the checksum, in which case comparison to checksums produced by other means (e.g., Python3 hashlib) will fail.  It wouldn't completely shock me if there's at least one latent security hole out there involving this. 

I do sometimes do variations on this command to look for duplicate files, which I now realize fails for odd filenames.

    find . -type f -print0 | xargs -0 md5sum | sort

It would have been nice if the quoting convention was more intuitive.  If you had asked me before all of this, I might have guessed that just backslash and newline were quoted in the filename as '\\' and '\n', and that the checksums themselves were not affected.  Seems more Unixy.

And though in GNU the man pages are not complete, this seems surprising enough to be worth mentioning.  As another possibility, perhaps this program and many more should sprout '-0' options.

Mike


-----Original Message-----
From: Bob Proulx <bob <at> proulx.com> 
Sent: Wednesday, June 24, 2020 02:34 PM
To: Michael Coleman <mcolema5 <at> uoregon.edu>
Cc: 41657 <at> debbugs.gnu.org
Subject: Re: bug#41657: md5sum: odd escaping for input filename \

close 41657
thanks

No one else has commented therefore I am closing the bug ticket.  But
the discussion may continue here.

Michael Coleman wrote:
> Thanks very much for your prompt reply.  Certainly, if this is
> documented behavior, it's not a bug.  I would have never thought to
> check the documentation as the behavior seems so strange.

I am not always so generous about documented behavior *never* being a
bug. :-)

> If I understand correctly, the leading backslash in the first field
> is an indication that the second field is escaped.  (The first field
> never needs escapes, as far as I can see.)

Right.  But it was available to clue in the md5sum and others that the
file name was an "unsafe" file name and was going to be escaped there.

> Not sure I would have chosen this, but it can't really be changed
> now.  But, I suspect that almost no real shell script would deal
> with this escaping correctly.  Really, I'd be surprised if there
> were even one example.  If so, perhaps it could be changed without
> trouble.

Let's talk about the shell scripting part.  Why would this ever need
to be parsed in a shell script?  And if so then that is precisely
where it would need to be done due to the file name!

Your own example was a file name that consisted of a single
backslash.  Since the backslash is the shell escape character then
handling that in a shell script would require escaping it properly
with a second backslash.

I will suggest that the primary use for the *sum utility output is as
input to the same utility later to check the content for differences.
That's arguably the primary use of it.

There are also cases where we will want to use the *sum utilities on a
single file.  That's fine.  I think the problematic case here might be
a usage like this usage.

  filename="\\"
  sum=$(md5sum "$filename" | awk '{print$1}')
  printf "%s\n" "$sum"
  \d41d8cd98f00b204e9800998ecf8427e

And then there is that extra backslash at the start of the hash.
Well, yes, that is unfortunate.  But in this case we already have the
filename in a variable and don't want the filename from md5sum.  This
is very similar to portability problems between different versions of
'wc' and other utilities too.  (Some 'wc' utils print leading spaces
and some do not.)

As you already deduced if md5sum does not have a file name then it
does not know if it is escaped or not.  Reading standard input instead
doesn't have a name and therefore "-" is used as a placeholder as per
the tradition.

  filename="\\"
  sum=$(md5sum < "$filename" | awk '{print$1}')
  printf "%s\n" "$sum"
  d41d8cd98f00b204e9800998ecf8427e

And because this is discussion I will note that the name is just one
of the possible names to a file.  Let's hard link it to a different
name.  And of course symbolic links are the same too.  A name is just
a pointer to a file.

  ln "$filename" foo
  md5sum foo
  d41d8cd98f00b204e9800998ecf8427e  foo

But I drift...

I think it likely you have already educated your people about the
problems and the solution was to read from stdin when the file name is
potentially untrusted "tainted" data.  (Since programming langauges
often refer to unknown untrusted data as "tainted" data for the
purpose of tracking what actions are safe upon it or not.  When taint
checking is enabled.)  Therefore if the name is unknown then it is
safer to avoid the name and use standard input.

And I suggest the same with other utilities such as 'wc' too.
Fortunately wc is not used to read back its own input.  Otherwise I am
sure someone would suggest that it would need the same escaping done
there too.  Example that thankfully does not actually exist:

  $ wc -l \\
  \0 \\

I am sure that if such a change were made that it would result in a
large wide spread breakage.  Let's hope that never happens.

Bob




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 24 Jul 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 267 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.