GNU bug report logs - #30719
Progressively compressing piped input

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: gzip; Reported by: "Garreau\, Alexandre" <galex-713@HIDDEN>; dated Mon, 5 Mar 2018 21:20:02 UTC; Maintainer for gzip is bug-gzip@HIDDEN.

Message received at 30719 <at> debbugs.gnu.org:


Received: (at 30719) by debbugs.gnu.org; 5 Mar 2018 22:54:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Mar 05 17:54:54 2018
Received: from localhost ([127.0.0.1]:46412 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1esz0P-0001mb-Jk
	for submit <at> debbugs.gnu.org; Mon, 05 Mar 2018 17:54:54 -0500
Received: from mail.alumni.caltech.edu ([131.215.242.114]:5679)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <madler@HIDDEN>) id 1esz0M-0001mK-OH
 for 30719 <at> debbugs.gnu.org; Mon, 05 Mar 2018 17:54:51 -0500
Received: from [17.115.236.2] (unknown [17.115.236.2])
 (Authenticated sender: madler)
 by mail.alumni.caltech.edu (Postfix) with ESMTPSA id B2E3E10674E1;
 Mon,  5 Mar 2018 14:54:22 -0800 (PST)
DKIM-Filter: OpenDKIM Filter v2.11.0 mail.alumni.caltech.edu B2E3E10674E1
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alumni.caltech.edu;
 s=enforce; t=1520290462;
 bh=7djQ16kgLl/xwbq0pZLUcBI/A5Nn2ZsMXT0enG7oZ3A=;
 h=Subject:From:In-Reply-To:Date:Cc:References:To:From;
 b=V5fhIkPRgqFMpUXW7jXxOdx8H6Im12CPV+krpX6Gvtl0wXLLpHSTU8hhIz1dgFLGF
 ZPp3HHbIQC2rdr8MR2J9DwdpUyFjDzRuvHZgtZYEZjNVRrbMfxykxpgmNKveoZipKN
 2EYSYxQWglxy3JGdGn11V8ml45RBLyelj/MTg4c8=
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_4E2713BB-B797-4685-9CB3-962C21B3388F"
Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\))
Subject: Re: bug#30719: Progressively compressing piped input
From: Mark Adler <madler@HIDDEN>
In-Reply-To: <ve1y9f9vsiln.46t.xxuns.g6.gal@HIDDEN>
Date: Mon, 5 Mar 2018 14:54:21 -0800
Message-Id: <54783A3B-7CB5-4CCB-BD3A-1828894750D4@HIDDEN>
References: <ve1y9f9vsiln.46t.xxuns.g6.gal@HIDDEN>
To: "Garreau, Alexandre" <galex-713@HIDDEN>
X-Mailer: Apple Mail (2.3445.5.20)
X-MailScanner-Information-Alumni: 
X-Alumni-MailScanner-ID: B2E3E10674E1.AEB30
X-MailScanner-Alumni: No Virii found
X-Spam-Status-Alumni: not spam, SpamAssassin (not cached, score=-1.099,
 required 5, ALL_TRUSTED -1.00, DKIM_SIGNED 0.10, DKIM_VALID -0.10,
 DKIM_VALID_AU -0.10, HTML_MESSAGE 0.00)
X-MailScanner-From: madler@HIDDEN
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 30719
Cc: 30719 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)


--Apple-Mail=_4E2713BB-B797-4685-9CB3-962C21B3388F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

deflate has an inherent latency that accumulates enough data in order to =
efficiently emit each deflate block. You can deliberately flush (with =
zlib, not gzip), but if you do that too frequently, e.g. each line, then =
you will get lousy compression or even expansion.

I wrote something called gzlog =
(https://github.com/madler/zlib/blob/master/examples/gzlog.h =
<https://github.com/madler/zlib/blob/master/examples/gzlog.h>), intended =
to solve this problem. It can take a small amount of input, e.g. a line, =
and update the output gzip file to be complete and valid after each =
line, yet also get good compression in the long run. It does this by =
writing the lines to the log.gz file effectively uncompressed (deflate =
has a =E2=80=9Cstored=E2=80=9D block type), until it has accumulated, =
say, 1 MB of data. Then it goes back and compresses that uncompressed 1 =
MB, again always leaving the gzip file in a valid state. gzlog also =
maintains something like a journal, which allows gzlog to repair the =
gzip file if the last operation was interrupted, e.g. by a power =
failure.

> On Mar 5, 2018, at 1:18 PM, Garreau, Alexandre =
<galex-713@HIDDEN> wrote:
>=20
> Hi,
>=20
> I have a script which has a logged very repetitive textual output
> (mostly output of ping and date). To minimize disk usage, I thought to
> pipe it to gzip -9. Then I realized the log, contrarily to before,
> remained empty, and recalled the GNU policy of =E2=80=9Creading all =
input and
> only then outputting=E2=80=9D to maximize overall speed at the expense =
of the
> decreasingly expensive memory.
>=20
> Yet I want to run that script all the time and being able to dirtily
> killing it or just shutdown, without loosing all its output (nor am I
> sure anyway it is a good practice of keeping everything in ram until
> shutdown, considering I suppose gzip only keeps the compressed output =
in
> memory anyway, discarding the then useless input), and =E2=80=9Ctail =
-f=E2=80=9D-ing the
> files it writes.
>=20
> I guess piping the whole output is the way to go to achieve optimal
> compression, since otherwise just gzipping each line/command output
> wouldn=E2=80=99t compress as much (since anyway the repetition occurs =
among the
> lines, not inside them). Yet would there be a way to obtain this =
maximal
> compression, while having gzip outputing each time I stop giving it
> input (has I do every 30 seconds or so), without having to save the
> uncompressed file, nor recompressing the whole file several times?
>=20
> I mean, it seems to me a good thing to wait everything is compressed
> before to output, rather than outputing as soon as possible, but =
isn=E2=80=99t
> there a way to trigger the output each time it has been processed and
> there=E2=80=99s no more input for a certain amount of time (that is =
~30s)?
>=20
> Am I looking at something like this:
> #!/bin/bash
> while ping -c1 gnu.org ; do
>    date --rfc-3339=3Dseconds
>    sleep 30
> done | gzip -9 -f | tee sample.log | zcat


--Apple-Mail=_4E2713BB-B797-4685-9CB3-962C21B3388F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" =
class=3D"">deflate has an inherent latency that accumulates enough data =
in order to efficiently emit each deflate block. You can deliberately =
flush (with zlib, not gzip), but if you do that too frequently, e.g. =
each line, then you will get lousy compression or even expansion.<div =
class=3D""><br class=3D""></div><div class=3D"">I wrote something called =
gzlog (<a =
href=3D"https://github.com/madler/zlib/blob/master/examples/gzlog.h" =
class=3D"">https://github.com/madler/zlib/blob/master/examples/gzlog.h</a>=
), intended to solve this problem. It can take a small amount of input, =
e.g. a line, and update the output gzip file to be complete and valid =
after each line, yet also get good compression in the long run. It does =
this by writing the lines to the log.gz file effectively uncompressed =
(deflate has a =E2=80=9Cstored=E2=80=9D block type), until it has =
accumulated, say, 1 MB of data. Then it goes back and compresses that =
uncompressed 1 MB, again always leaving the gzip file in a valid state. =
gzlog also maintains something like a journal, which allows gzlog to =
repair the gzip file if the last operation was interrupted, e.g. by a =
power failure.<br class=3D""><div><br class=3D""><blockquote type=3D"cite"=
 class=3D""><div class=3D"">On Mar 5, 2018, at 1:18 PM, Garreau, =
Alexandre &lt;<a href=3D"mailto:galex-713@HIDDEN" =
class=3D"">galex-713@HIDDEN</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D"">Hi,<br class=3D""><br =
class=3D"">I have a script which has a logged very repetitive textual =
output<br class=3D"">(mostly output of ping and date). To minimize disk =
usage, I thought to<br class=3D"">pipe it to gzip -9. Then I realized =
the log, contrarily to before,<br class=3D"">remained empty, and =
recalled the GNU policy of =E2=80=9Creading all input and<br =
class=3D"">only then outputting=E2=80=9D to maximize overall speed at =
the expense of the<br class=3D"">decreasingly expensive memory.<br =
class=3D""><br class=3D"">Yet I want to run that script all the time and =
being able to dirtily<br class=3D"">killing it or just shutdown, without =
loosing all its output (nor am I<br class=3D"">sure anyway it is a good =
practice of keeping everything in ram until<br class=3D"">shutdown, =
considering I suppose gzip only keeps the compressed output in<br =
class=3D"">memory anyway, discarding the then useless input), and =
=E2=80=9Ctail -f=E2=80=9D-ing the<br class=3D"">files it writes.<br =
class=3D""><br class=3D"">I guess piping the whole output is the way to =
go to achieve optimal<br class=3D"">compression, since otherwise just =
gzipping each line/command output<br class=3D"">wouldn=E2=80=99t =
compress as much (since anyway the repetition occurs among the<br =
class=3D"">lines, not inside them). Yet would there be a way to obtain =
this maximal<br class=3D"">compression, while having gzip outputing each =
time I stop giving it<br class=3D"">input (has I do every 30 seconds or =
so), without having to save the<br class=3D"">uncompressed file, nor =
recompressing the whole file several times?<br class=3D""><br class=3D"">I=
 mean, it seems to me a good thing to wait everything is compressed<br =
class=3D"">before to output, rather than outputing as soon as possible, =
but isn=E2=80=99t<br class=3D"">there a way to trigger the output each =
time it has been processed and<br class=3D"">there=E2=80=99s no more =
input for a certain amount of time (that is ~30s)?<br class=3D""><br =
class=3D"">Am I looking at something like this:<br =
class=3D"">#!/bin/bash<br class=3D"">while ping -c1 <a =
href=3D"http://gnu.org" class=3D"">gnu.org</a> ; do<br class=3D""> =
&nbsp;&nbsp;&nbsp;date --rfc-3339=3Dseconds<br class=3D""> =
&nbsp;&nbsp;&nbsp;sleep 30<br class=3D"">done | gzip -9 -f | tee =
sample.log | zcat<br class=3D""></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_4E2713BB-B797-4685-9CB3-962C21B3388F--




Information forwarded to bug-gzip@HIDDEN:
bug#30719; Package gzip. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 5 Mar 2018 21:19:26 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Mar 05 16:19:26 2018
Received: from localhost ([127.0.0.1]:46323 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1esxW0-00060K-RM
	for submit <at> debbugs.gnu.org; Mon, 05 Mar 2018 16:19:25 -0500
Received: from eggs.gnu.org ([208.118.235.92]:57897)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <galex-713@HIDDEN>) id 1esxVo-0005zf-UW
 for submit <at> debbugs.gnu.org; Mon, 05 Mar 2018 16:19:13 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <galex-713@HIDDEN>) id 1esxVi-0001Ec-De
 for submit <at> debbugs.gnu.org; Mon, 05 Mar 2018 16:19:07 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_DKIM_INVALID
 autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:41979)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <galex-713@HIDDEN>)
 id 1esxVi-0001EY-8u
 for submit <at> debbugs.gnu.org; Mon, 05 Mar 2018 16:19:06 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48811)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <galex-713@HIDDEN>) id 1esxVh-0003sX-04
 for bug-gzip@HIDDEN; Mon, 05 Mar 2018 16:19:06 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <galex-713@HIDDEN>) id 1esxVd-0001Bo-Pm
 for bug-gzip@HIDDEN; Mon, 05 Mar 2018 16:19:04 -0500
Received: from [2a01:e34:ec07:c940:20f:feff:fe1d:bfc] (port=58405
 helo=galex-713.eu)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <galex-713@HIDDEN>)
 id 1esxVc-00019o-V8
 for bug-gzip@HIDDEN; Mon, 05 Mar 2018 16:19:01 -0500
Received: from PC713 (unknown [37.171.183.80])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 (Authenticated sender: galex-713)
 by galex-713.eu (Postfix) with ESMTPSA id 1D13B15F5CF
 for <bug-gzip@HIDDEN>; Mon,  5 Mar 2018 22:18:56 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=galex-713.eu; s=dkim;
 t=1520284736; bh=wl6XlXxWWnJosGcMNs3OOgKlYd+FF6L+HDF/f2QOzK4=;
 h=From:To:Subject:Date:From;
 b=aaKZHPb4wNxMusK3nw7Si91CL1Atl4/wQFS1UcSunSt0Ntlqq6md89jz8/Uuwkp7l
 BxrsaA64omIM8YFjmcrVLVYXgqDsYH9INhxD/yFx2mSm8SImSsN7us8PM/qxfPmmpm
 yOOtasD83Fcx/gvGtTzkBuy4da7SBzdXcVG7V5v8=
From: "Garreau\, Alexandre" <galex-713@HIDDEN>
To: bug-gzip@HIDDEN
Subject: Progressively compressing piped input
User-Agent: Gnus (5.13), GNU Emacs 25.1.1 (x86_64-pc-linux-gnu)
X-GPG-FINGERPRINT: E109 9988 4197 D7CB B0BC 5C23 8DEB 24BA 867D 3F7F
X-Accept-Language: fr, en, it, eo
Date: Mon, 05 Mar 2018 22:18:53 +0100
Message-ID: <ve1y9f9vsiln.46t.xxuns.g6.gal@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Mon, 05 Mar 2018 16:19:23 -0500
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi,

I have a script which has a logged very repetitive textual output
(mostly output of ping and date). To minimize disk usage, I thought to
pipe it to gzip -9. Then I realized the log, contrarily to before,
remained empty, and recalled the GNU policy of =E2=80=9Creading all input a=
nd
only then outputting=E2=80=9D to maximize overall speed at the expense of t=
he
decreasingly expensive memory.

Yet I want to run that script all the time and being able to dirtily
killing it or just shutdown, without loosing all its output (nor am I
sure anyway it is a good practice of keeping everything in ram until
shutdown, considering I suppose gzip only keeps the compressed output in
memory anyway, discarding the then useless input), and =E2=80=9Ctail -f=E2=
=80=9D-ing the
files it writes.

I guess piping the whole output is the way to go to achieve optimal
compression, since otherwise just gzipping each line/command output
wouldn=E2=80=99t compress as much (since anyway the repetition occurs among=
 the
lines, not inside them). Yet would there be a way to obtain this maximal
compression, while having gzip outputing each time I stop giving it
input (has I do every 30 seconds or so), without having to save the
uncompressed file, nor recompressing the whole file several times?

I mean, it seems to me a good thing to wait everything is compressed
before to output, rather than outputing as soon as possible, but isn=E2=80=
=99t
there a way to trigger the output each time it has been processed and
there=E2=80=99s no more input for a certain amount of time (that is ~30s)?

Am I looking at something like this:

--=-=-=
Content-Type: text/x-sh
Content-Disposition: inline; filename=sample.sh
Content-Description: An example of what am I trying to do, where
 =?utf-8?Q?I=E2=80=99d?= like regular output

#!/bin/bash
while ping -c1 gnu.org ; do
    date --rfc-3339=seconds
    sleep 30
done | gzip -9 -f | tee sample.log | zcat

--=-=-=--




Acknowledgement sent to "Garreau\, Alexandre" <galex-713@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-gzip@HIDDEN. Full text available.
Report forwarded to bug-gzip@HIDDEN:
bug#30719; Package gzip. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 5 Mar 2018 23:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.