GNU bug report logs - #30935
gzip -l reports wrong size for decompressed files larger than 4GB

Previous Next

Package: gzip;

Reported by: Wolfgang Formann <wformann <at> arcor.de>

Date: Sun, 25 Mar 2018 13:31:03 UTC

Severity: normal

Merged with 17804, 29089, 30936, 38766, 42965, 48424, 52227

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30935 in the body.
You can then email your comments to 30935 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gzip <at> gnu.org:
bug#30935; Package gzip. (Sun, 25 Mar 2018 13:31:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Wolfgang Formann <wformann <at> arcor.de>:
New bug report received and forwarded. Copy sent to bug-gzip <at> gnu.org. (Sun, 25 Mar 2018 13:31:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Formann <wformann <at> arcor.de>
To: bug-gzip <at> gnu.org
Subject: gzip -l reports wrong size for decompressed files larger than 4GB
Date: Sun, 25 Mar 2018 10:42:42 +0200
Hello!

I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches

$ file /usr/bin/gzip
/usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter 
/lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped

There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes. 
Uncompressed it will have a size of 19.465.374.298

The problem is:
$ gzip -l GND.rdf.gz
         compressed        uncompressed  ratio uncompressed_name
         1232465678          2285505114  46.1% GND.rdf

This number 2285505114 is actually the lower 32 bits of the real size 19GB.
$ echo "19465374298-16*1024*1024*1024" | bc
2285505114

Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.

Thanks
Wolfgang





Information forwarded to bug-gzip <at> gnu.org:
bug#30935; Package gzip. (Sun, 25 Mar 2018 21:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 30935 <at> debbugs.gnu.org (full text, mbox):

From: Mark Adler <madler <at> alumni.caltech.edu>
To: Wolfgang Formann <wformann <at> arcor.de>
Cc: 30935 <at> debbugs.gnu.org
Subject: Re: bug#30935: gzip -l reports wrong size for decompressed files
 larger than 4GB
Date: Sun, 25 Mar 2018 14:05:52 -0700
Wolfgang,

The gzip format stores only the low 32 bits of the uncompressed length as the last four bytes of the stream, so it is not possible to show the correct number. At least not without decompressing the whole thing.

There are two other ways that the displayed uncompressed size can be incorrect, even for small files. Those are if a) there is more than one gzip member in the gzip stream, in which case only the uncompressed size of the last member will be shown, or b) if there are junk bytes after the end of the gzip stream, in which case the junk will be shown as the length.

In short, the reported length is informational at best, and should not be trusted if the information is important.The purpose of the length modulo 2^32 being in the trailer is as an additional integrity check along with the CRC. However it was also used for gzip -l, which was perhaps a mistake.

You can get the actual decompressed length only by decompressing, and discarding the uncompressed data if you only want the length. You can either:

    gzip -dc file.gz | wc -c

or:

    pigz -lt file.gz

The latter will report the members of the gzip stream separately.

Mark


> On Mar 25, 2018, at 1:42 AM, Wolfgang Formann <wformann <at> arcor.de> wrote:
> 
> Hello!
> 
> I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches
> 
> $ file /usr/bin/gzip
> /usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped
> 
> There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes. Uncompressed it will have a size of 19.465.374.298
> 
> The problem is:
> $ gzip -l GND.rdf.gz
>         compressed        uncompressed  ratio uncompressed_name
>         1232465678          2285505114  46.1% GND.rdf
> 
> This number 2285505114 is actually the lower 32 bits of the real size 19GB.
> $ echo "19465374298-16*1024*1024*1024" | bc
> 2285505114
> 
> Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.
> 
> Thanks
> Wolfgang
> 
> 
> 
> 





Information forwarded to bug-gzip <at> gnu.org:
bug#30935; Package gzip. (Sun, 25 Mar 2018 22:59:03 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Formann <wformann <at> arcor.de>
To: bug-gzip <at> gnu.org
Subject: Re: bug#30935: gzip -l reports wrong size for decompressed files
 larger than 4GB
Date: Sun, 25 Mar 2018 23:25:42 +0200
Mark,

I accept that problem. I would be happy, when a similar statement like yours would be in the man page of gzip.

Wolfgang

Mark Adler schrieb:
> Wolfgang,
>
> The gzip format stores only the low 32 bits of the uncompressed length as the last four bytes of the stream, so it is not possible to show the correct number. At least not without decompressing the whole thing.
>
> There are two other ways that the displayed uncompressed size can be incorrect, even for small files. Those are if a) there is more than one gzip member in the gzip stream, in which case only the uncompressed size of the last member will be shown, or b) if there are junk bytes after the end of the gzip stream, in which case the junk will be shown as the length.
>
> In short, the reported length is informational at best, and should not be trusted if the information is important.The purpose of the length modulo 2^32 being in the trailer is as an additional integrity check along with the CRC. However it was also used for gzip -l, which was perhaps a mistake.
>
> You can get the actual decompressed length only by decompressing, and discarding the uncompressed data if you only want the length. You can either:
>
>     gzip -dc file.gz | wc -c
>
> or:
>
>     pigz -lt file.gz
>
> The latter will report the members of the gzip stream separately.
>
> Mark
>
>
>> On Mar 25, 2018, at 1:42 AM, Wolfgang Formann <wformann <at> arcor.de> wrote:
>>
>> Hello!
>>
>> I am using gzip 1.6 from openSUSE Leap 42.3 with latest patches
>>
>> $ file /usr/bin/gzip
>> /usr/bin/gzip: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, BuildID[sha1]=7103d56e17e6f81a52db927e393dce601c3af0e1, stripped
>>
>> There is a compressed file available at https://data.dnb.de/opendata/GND.rdf.gz which has a size of 1.232.465.678 bytes. Uncompressed it will have a size of 19.465.374.298
>>
>> The problem is:
>> $ gzip -l GND.rdf.gz
>>         compressed        uncompressed  ratio uncompressed_name
>>         1232465678          2285505114  46.1% GND.rdf
>>
>> This number 2285505114 is actually the lower 32 bits of the real size 19GB.
>> $ echo "19465374298-16*1024*1024*1024" | bc
>> 2285505114
>>
>> Such a behaviour is okay for 32-bit software, 64-bit should show correct numbers.
>>
>> Thanks
>> Wolfgang
>>
>>
>>
>>
>
>
>
>
>





Information forwarded to bug-gzip <at> gnu.org:
bug#30935; Package gzip. (Mon, 26 Mar 2018 01:37:01 GMT) Full text and rfc822 format available.

Message #14 received at 30935 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Wolfgang Formann <wformann <at> arcor.de>, 30935 <at> debbugs.gnu.org
Subject: Re: bug#30935: gzip -l reports wrong size for decompressed files
 larger than 4GB
Date: Sun, 25 Mar 2018 18:36:34 -0700
Wolfgang Formann wrote:
> I accept that problem. I would be happy, when a similar statement like yours 
> would be in the man page of gzip.

It already is in the gzip manual, which is the main source of detailed info like 
that.




Information forwarded to bug-gzip <at> gnu.org:
bug#30935; Package gzip. (Mon, 26 Mar 2018 01:51:02 GMT) Full text and rfc822 format available.

Message #17 received at 30935 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Wolfgang Formann <wformann <at> arcor.de>, 30935 <at> debbugs.gnu.org
Subject: Re: bug#30935: gzip -l reports wrong size for decompressed files
 larger than 4GB
Date: Sun, 25 Mar 2018 18:49:46 -0700
tags 30935 notabug
close 30935
stop

On Sun, Mar 25, 2018 at 6:36 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Wolfgang Formann wrote:
>>
>> I accept that problem. I would be happy, when a similar statement like
>> yours would be in the man page of gzip.
>
> It already is in the gzip manual, which is the main source of detailed info
> like that.

Marking this "issue" as closed in our bug tracker.




Merged 17804 29089 30935 30936 38766 42965 48424 52227. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 01 Dec 2021 23:34:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 13 Jan 2022 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 100 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.