GNU bug report logs - #34133
Huge memory usage and output size when using "H" and "G"

Previous Next

Package: sed;

Reported by: Hongxu Chen <leftcopy.chx <at> gmail.com>

Date: Sat, 19 Jan 2019 09:54:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 34133 in the body.
You can then email your comments to 34133 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sat, 19 Jan 2019 09:54:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Hongxu Chen <leftcopy.chx <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Sat, 19 Jan 2019 09:54:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Hongxu Chen <leftcopy.chx <at> gmail.com>
To: bug-sed <at> gnu.org
Subject: Huge memory usage and output size when using "H" and "G"
Date: Sat, 19 Jan 2019 17:53:05 +0800
[Message part 1 (text/plain, inline)]
Hi,

    We found an issue that are relevant to use of "H" and "G" for appending
hold space and pattern space.

    The input file is attached which is a file of 30 lines and 80 columns
filled with 'a'. And my memory is 64G with equivalent swap.

      # these two may eat up the memory
    sed 's/a/d/; G; H;' input
    sed '/b/d; G; H;' input

     # this is fine
    sed '/a/d; G; H;' input

    I learned from http://www.grymoire.com/Unix/Sed.html that 'G' appends
hold space to pattern space, and 'H' does the inverse.
    In the first two examples, the buffer of hold space will be appended to
pattern space, and subsequently content of pattern space will be appended
to hold space once more. With one more input line, the two buffers will be
doubled; and as long as the input file is big enough, sed may finally eat
up the memory and populate the output.
    We think this is vulnerable since it may eat up the memory in a few
seconds.

Best Regards,
Hongxu
[Message part 2 (text/html, inline)]
[input (application/octet-stream, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sat, 19 Jan 2019 09:58:01 GMT) Full text and rfc822 format available.

Message #8 received at 34133 <at> debbugs.gnu.org (full text, mbox):

From: Hongxu Chen <leftcopy.chx <at> gmail.com>
To: 34133 <at> debbugs.gnu.org
Subject: Duplicate of 34133
Date: Sat, 19 Jan 2019 17:57:39 +0800
[Message part 1 (text/plain, inline)]
Hi GNU sed maintainers,

    Sorry I mistakenly sent an email with empty content.
    Please close this issue and track #34133 instead.
    Thank you!

Best Regards,
Hongxu
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sat, 19 Jan 2019 21:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 34133 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Hongxu Chen <leftcopy.chx <at> gmail.com>, 34133 <at> debbugs.gnu.org
Subject: Re: bug#34133: Huge memory usage and output size when using "H" and
 "G"
Date: Sat, 19 Jan 2019 14:27:30 -0700
tags 34133 notabug
close 34133
stop

Hello,

On 2019-01-19 2:53 a.m., Hongxu Chen wrote:
>      We found an issue that are relevant to use of "H" and "G" for appending
> hold space and pattern space.

It is an "issue" in the sense that your example does consume large
amounts of memory, but it is not a bug - this is how sed works.

>      The input file is attached which is a file of 30 lines and 80 columns
> filled with 'a'. And my memory is 64G with equivalent swap.
> 
>        # these two may eat up the memory
>      sed 's/a/d/; G; H;' input
>      sed '/b/d; G; H;' input


Let's simplify:
The "s/a/d/" does not change anything related to memory
(it changes a single letter "a" to "d" in the input), so I'll omit it.

The '/b/d' command is a no-op, because your input does not contain
the letter "b".

We're left with:
   sed 'G;H'
The length of each line also doesn't matter, so I'll use shorter lines.

Now observe the following:

$ printf "%s\n" 0 | sed 'G;H' | wc -l
2
$ printf "%s\n" 0 1 | sed 'G;H' | wc -l
6
$ printf "%s\n" 0 1 2 | sed 'G;H' | wc -l
14
$ printf "%s\n" 0 1 2 3 | sed 'G;H' | wc -l
30
$ printf "%s\n" 0 1 2 3 4 | sed 'G;H' | wc -l
62
$ printf "%s\n" 0 1 2 3 4 5 | sed 'G;H' | wc -l
126
$ printf "%s\n" 0 1 2 3 4 5 6 | sed 'G;H' | wc -l
254
$ printf "%s\n" 0 1 2 3 4 5 6 7 | sed 'G;H' | wc -l
510
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 | sed 'G;H' | wc -l
1022
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 | sed 'G;H' | wc -l
2046
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 | sed 'G;H' | wc -l
4094
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 | sed 'G;H' | wc -l
8190
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 12 | sed 'G;H' | wc -l
16382

Notice the trend?
The number of lines (and by proxy: size of buffer and memory usage)
is exponential.

With 20 lines, you'll need O(2^20) = 1M memory (plus size of each line,
and size of pointers overhead, etc.). Still doable.

With 30 lines, you'll need O(2^30) = 1G of lines.
If each of your lines is 80 characters, you'll need 80GB (before
counting overhead of pointers).


>       # this is fine
>      sed '/a/d; G; H;' input

This is "fine" because the "/a/d" command deletes all lines of your
input, hence nothing is stored in the pattern/hold buffers.

>      I learned from http://www.grymoire.com/Unix/Sed.html that 'G' appends
> hold space to pattern space, and 'H' does the inverse.
>      In the first two examples, the buffer of hold space will be appended to
> pattern space, and subsequently content of pattern space will be appended
> to hold space once more. With one more input line, the two buffers will be
> doubled; and as long as the input file is big enough, sed may finally eat
> up the memory and populate the output.

Yes, that how it works.

>      We think this is vulnerable since it may eat up the memory in a few
> seconds.

Any program that keeps the input in memory is vulnerable
to unbounded input size. That is not a bug.

As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.

regards,
 - assaf





Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sat, 19 Jan 2019 21:28:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 34133 <at> debbugs.gnu.org and Hongxu Chen <leftcopy.chx <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sat, 19 Jan 2019 21:28:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sun, 20 Jan 2019 02:24:02 GMT) Full text and rfc822 format available.

Message #18 received at 34133 <at> debbugs.gnu.org (full text, mbox):

From: Hongxu Chen <leftcopy.chx <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 34133 <at> debbugs.gnu.org
Subject: Re: bug#34133: Huge memory usage and output size when using "H" and
 "G"
Date: Sun, 20 Jan 2019 10:23:16 +0800
[Message part 1 (text/plain, inline)]
Hi Assaf,

    Thanks  for the explanation.

    We think the way sed works may suffer from attacks. If the user
downloads some
sed scripts and run *without root privilege*, the host machine may soon
exceed
the memory; in my case, the machine actually hangs and I have to restart
it. The
problem may be severer when the machine is hosting some service or does
the sed relevant service such as text processing (may be rare) itself even
inside
some sandbox. The issue may also be triggered unconsciously thus cause
surprise
and trouble.

> Any program that keeps the input in memory is vulnerable to unbounded
input size

    I think input size is not big; and the size can still be reduced as
long as more "G;H"s
are appended to the script.
 Maybe sed can do something flush to avoid memory usage?

Best Regards,
Hongxu


On Sun, Jan 20, 2019 at 5:27 AM Assaf Gordon <assafgordon <at> gmail.com> wrote:

> tags 34133 notabug
> close 34133
> stop
>
> Hello,
>
> On 2019-01-19 2:53 a.m., Hongxu Chen wrote:
> >      We found an issue that are relevant to use of "H" and "G" for
> appending
> > hold space and pattern space.
>
> It is an "issue" in the sense that your example does consume large
> amounts of memory, but it is not a bug - this is how sed works.
>
> >      The input file is attached which is a file of 30 lines and 80
> columns
> > filled with 'a'. And my memory is 64G with equivalent swap.
> >
> >        # these two may eat up the memory
> >      sed 's/a/d/; G; H;' input
> >      sed '/b/d; G; H;' input
>
>
> Let's simplify:
> The "s/a/d/" does not change anything related to memory
> (it changes a single letter "a" to "d" in the input), so I'll omit it.
>
> The '/b/d' command is a no-op, because your input does not contain
> the letter "b".
>
> We're left with:
>     sed 'G;H'
> The length of each line also doesn't matter, so I'll use shorter lines.
>
> Now observe the following:
>
> $ printf "%s\n" 0 | sed 'G;H' | wc -l
> 2
> $ printf "%s\n" 0 1 | sed 'G;H' | wc -l
> 6
> $ printf "%s\n" 0 1 2 | sed 'G;H' | wc -l
> 14
> $ printf "%s\n" 0 1 2 3 | sed 'G;H' | wc -l
> 30
> $ printf "%s\n" 0 1 2 3 4 | sed 'G;H' | wc -l
> 62
> $ printf "%s\n" 0 1 2 3 4 5 | sed 'G;H' | wc -l
> 126
> $ printf "%s\n" 0 1 2 3 4 5 6 | sed 'G;H' | wc -l
> 254
> $ printf "%s\n" 0 1 2 3 4 5 6 7 | sed 'G;H' | wc -l
> 510
> $ printf "%s\n" 0 1 2 3 4 5 6 7 8 | sed 'G;H' | wc -l
> 1022
> $ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 | sed 'G;H' | wc -l
> 2046
> $ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 | sed 'G;H' | wc -l
> 4094
> $ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 | sed 'G;H' | wc -l
> 8190
> $ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 12 | sed 'G;H' | wc -l
> 16382
>
> Notice the trend?
> The number of lines (and by proxy: size of buffer and memory usage)
> is exponential.
>
> With 20 lines, you'll need O(2^20) = 1M memory (plus size of each line,
> and size of pointers overhead, etc.). Still doable.
>
> With 30 lines, you'll need O(2^30) = 1G of lines.
> If each of your lines is 80 characters, you'll need 80GB (before
> counting overhead of pointers).
>
>
> >       # this is fine
> >      sed '/a/d; G; H;' input
>
> This is "fine" because the "/a/d" command deletes all lines of your
> input, hence nothing is stored in the pattern/hold buffers.
>
> >      I learned from http://www.grymoire.com/Unix/Sed.html that 'G'
> appends
> > hold space to pattern space, and 'H' does the inverse.
> >      In the first two examples, the buffer of hold space will be
> appended to
> > pattern space, and subsequently content of pattern space will be appended
> > to hold space once more. With one more input line, the two buffers will
> be
> > doubled; and as long as the input file is big enough, sed may finally eat
> > up the memory and populate the output.
>
> Yes, that how it works.
>
> >      We think this is vulnerable since it may eat up the memory in a few
> > seconds.
>
> Any program that keeps the input in memory is vulnerable
> to unbounded input size. That is not a bug.
>
> As such, I'm closing this as "not a bug", but discussion can continue
> by replying to this thread.
>
> regards,
>   - assaf
>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sun, 20 Jan 2019 03:38:02 GMT) Full text and rfc822 format available.

Message #21 received at 34133 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Hongxu Chen <leftcopy.chx <at> gmail.com>
Cc: 34133 <at> debbugs.gnu.org
Subject: Re: bug#34133: Huge memory usage and output size when using "H" and
 "G"
Date: Sat, 19 Jan 2019 20:37:35 -0700
Hello,

On 2019-01-19 7:23 p.m., Hongxu Chen wrote:
>      We think the way sed works may suffer from attacks.

Not more than any other program that uses memory or disk
based on its input.

> If the user downloads some
> sed scripts and run *without root privilege*, the host machine may soon 
> exceed
> the memory;

First,
root privileges have nothing to do with it.
A non-privileged user can consume as much memory as they want,
and a poorly configured machine will be brought to its knees.

Well configured machines will not allow user-space programs
to cripple them (but I readily admit that such configuration
is not trivial to achieve).

Second,
Any user who downloads any program from untrusted source is
expected to know what they're doing.
If they don't - then the can cause a lot of damage.

This has nothing to do with sed.

> in my case, the machine actually hangs and I have to restart 
> it.

Yes, many common installations are not configured to handle
memory exhaustion.

> The
> problem may be severer when the machine is hosting some service or does
> the sed relevant service such as text processing (may be rare) itself 
> even inside
> some sandbox. The issue may also be triggered unconsciously thus cause 
> surprise
> and trouble.

That is all true, but has nothing to do with sed.

> 
>  > Any program that keeps the input in memory is vulnerable to unbounded 
> input size
> 
>      I think input size is not big; and the size can still be reduced as 
> long as more "G;H"s
> are appended to the script.
>   Maybe sed can do something flush to avoid memory usage?
> 

I'll rephrase my words:

Your sed program has O(2^N) space requirements,
Any program that have exponential behavior (be it space or time)
will quickly lead to pathological cases.

So while your input file is not too big (N=30 lines),
it leads to huge memory requirements.

I recommend reading some background about complexity
(most of these deals with Time complexity, but it applies to space as well):
  http://bigocheatsheet.com/
  https://en.wikipedia.org/wiki/Time_complexity#Exponential_time
  https://en.wikipedia.org/wiki/Computational_complexity

To illustrate my point further, here are similar examples in AWK and
PERL that would choke with your input:

  awk '{ buf = buf $1 ; buf = buf buf } END { print buf }'
  perl -ne '$buf .= $_ ; $buf .= $buf ; print $buf' < input

Just as you wrote above, a non-root user who runs these on your input
will consume a huge amount of memory. This is why I said it has nothing
to do with sed.

To see the unfolding of exponential growth in action,
try the following C program (which goes to show it is not just 
(awk/perl/sed scripts):

    /* Compile with:
          gcc -o 1 1.c -lm

       Run with:
         seq 30 | ./1
    */
    #define _GNU_SOURCE
    #define _POSIX_C_SOURCE 200809L
    #include <stdlib.h>
    #include <stdio.h>
    #include <math.h>

    int main()
    {
      char *buf,*line;
      size_t linenum = 0;
      size_t buf_size = 0;
      size_t n;
      ssize_t i;

      while ( (i = getline(&line,&n, stdin)) != -1) {
         ++linenum;
         buf_size += n;
         buf_size *= 2;
         printf ("line %zu (%zu bytes), 2^%zu == %g, buf_size = %zu 
bytes\n",
    		linenum, n, linenum, pow(2, linenum) , buf_size);
      }
      return 0;

    }


The above does not actually allocate memory, it just shows how much
would be allocated. Run it with your input and see for yourself:

  line 1 (120 bytes), 2^1 == 2, buf_size = 240 bytes
  line 2 (120 bytes), 2^2 == 4, buf_size = 720 bytes
  line 3 (120 bytes), 2^3 == 8, buf_size = 1680 bytes
  [...]
  line 30 (120 bytes), 2^30 == 1.07374e+09, buf_size = 257698037520 bytes

If you tried to do "malloc(buf_size)" it would consume all your memory
(if you have less than 256GB).

----

This applies not just to memory (RAM), but to disk space as well
(which conceptually is just different type of storage).

Try this example:

    printf a > in
    for i in $(seq 20); do cat in in > out ; mv out in ; done ;

The input file starts with 1 byte.
After 20 rounds, the file size will be 1MB.
If you try 30 rounds, the file will be 1GB.
The "20" and "30" here correspond to your input size (number of lines).
Small change in input leads to large changes in output (thus
"exponential" programs are considered bad).

If you are still not convinced, try 40 rounds - that will attempt
to create a 1TB file. If you disk is smaller - it will become full
(which is just like memory exhaustion).

----

I hope this resolves the issue.

regards,
 - assaf











Information forwarded to bug-sed <at> gnu.org:
bug#34133; Package sed. (Sun, 20 Jan 2019 06:39:02 GMT) Full text and rfc822 format available.

Message #24 received at 34133 <at> debbugs.gnu.org (full text, mbox):

From: Hongxu Chen <leftcopy.chx <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 34133 <at> debbugs.gnu.org
Subject: Re: bug#34133: Huge memory usage and output size when using "H" and
 "G"
Date: Sun, 20 Jan 2019 14:38:18 +0800
[Message part 1 (text/plain, inline)]
Hi Assaf,

    While I still think that this is sed's defect, I agree that programmers
should ensure
the script will not result in a bomb. What I'm thinking is whether there
can be some
builtin check to avoid this (e.g., roughly suppress use of “G” directly
followed by "H").

In addition, the error messages can be more meaningful in certain
scenarios. I believe
there involve different aspects of optimizations in the following scripts,
but the stderr
output can be improved.

sed '/b/e; G; H;' input >/dev/null  # instant exit, fine
sed '/a/e; G; H;' input >/dev/null  # many lines of `sh: 1: Syntax error:
";" unexpected`
sed '/a/x; G; H;' input >/dev/null # instant exit, fine
sed '/b/x; G; H;' input >/dev/null # long wait

Generally, the *semantics* of sed scripts had better to be much clearer. To
be honest,
I'm really confused about their meaning until I think for quite a while.


Best Regards,
Hongxu


On Sun, Jan 20, 2019 at 11:37 AM Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello,
>
> On 2019-01-19 7:23 p.m., Hongxu Chen wrote:
> >      We think the way sed works may suffer from attacks.
>
> Not more than any other program that uses memory or disk
> based on its input.
>
> > If the user downloads some
> > sed scripts and run *without root privilege*, the host machine may soon
> > exceed
> > the memory;
>
> First,
> root privileges have nothing to do with it.
> A non-privileged user can consume as much memory as they want,
> and a poorly configured machine will be brought to its knees.
>
> Well configured machines will not allow user-space programs
> to cripple them (but I readily admit that such configuration
> is not trivial to achieve).
>
> Second,
> Any user who downloads any program from untrusted source is
> expected to know what they're doing.
> If they don't - then the can cause a lot of damage.
>
> This has nothing to do with sed.
>
> > in my case, the machine actually hangs and I have to restart
> > it.
>
> Yes, many common installations are not configured to handle
> memory exhaustion.
>
> > The
> > problem may be severer when the machine is hosting some service or does
> > the sed relevant service such as text processing (may be rare) itself
> > even inside
> > some sandbox. The issue may also be triggered unconsciously thus cause
> > surprise
> > and trouble.
>
> That is all true, but has nothing to do with sed.
>
> >
> >  > Any program that keeps the input in memory is vulnerable to unbounded
> > input size
> >
> >      I think input size is not big; and the size can still be reduced as
> > long as more "G;H"s
> > are appended to the script.
> >   Maybe sed can do something flush to avoid memory usage?
> >
>
> I'll rephrase my words:
>
> Your sed program has O(2^N) space requirements,
> Any program that have exponential behavior (be it space or time)
> will quickly lead to pathological cases.
>
> So while your input file is not too big (N=30 lines),
> it leads to huge memory requirements.
>
> I recommend reading some background about complexity
> (most of these deals with Time complexity, but it applies to space as
> well):
>    http://bigocheatsheet.com/
>    https://en.wikipedia.org/wiki/Time_complexity#Exponential_time
>    https://en.wikipedia.org/wiki/Computational_complexity
>
> To illustrate my point further, here are similar examples in AWK and
> PERL that would choke with your input:
>
>    awk '{ buf = buf $1 ; buf = buf buf } END { print buf }'
>    perl -ne '$buf .= $_ ; $buf .= $buf ; print $buf' < input
>
> Just as you wrote above, a non-root user who runs these on your input
> will consume a huge amount of memory. This is why I said it has nothing
> to do with sed.
>
> To see the unfolding of exponential growth in action,
> try the following C program (which goes to show it is not just
> (awk/perl/sed scripts):
>
>      /* Compile with:
>            gcc -o 1 1.c -lm
>
>         Run with:
>           seq 30 | ./1
>      */
>      #define _GNU_SOURCE
>      #define _POSIX_C_SOURCE 200809L
>      #include <stdlib.h>
>      #include <stdio.h>
>      #include <math.h>
>
>      int main()
>      {
>        char *buf,*line;
>        size_t linenum = 0;
>        size_t buf_size = 0;
>        size_t n;
>        ssize_t i;
>
>        while ( (i = getline(&line,&n, stdin)) != -1) {
>           ++linenum;
>           buf_size += n;
>           buf_size *= 2;
>           printf ("line %zu (%zu bytes), 2^%zu == %g, buf_size = %zu
> bytes\n",
>                 linenum, n, linenum, pow(2, linenum) , buf_size);
>        }
>        return 0;
>
>      }
>
>
> The above does not actually allocate memory, it just shows how much
> would be allocated. Run it with your input and see for yourself:
>
>    line 1 (120 bytes), 2^1 == 2, buf_size = 240 bytes
>    line 2 (120 bytes), 2^2 == 4, buf_size = 720 bytes
>    line 3 (120 bytes), 2^3 == 8, buf_size = 1680 bytes
>    [...]
>    line 30 (120 bytes), 2^30 == 1.07374e+09, buf_size = 257698037520 bytes
>
> If you tried to do "malloc(buf_size)" it would consume all your memory
> (if you have less than 256GB).
>
> ----
>
> This applies not just to memory (RAM), but to disk space as well
> (which conceptually is just different type of storage).
>
> Try this example:
>
>      printf a > in
>      for i in $(seq 20); do cat in in > out ; mv out in ; done ;
>
> The input file starts with 1 byte.
> After 20 rounds, the file size will be 1MB.
> If you try 30 rounds, the file will be 1GB.
> The "20" and "30" here correspond to your input size (number of lines).
> Small change in input leads to large changes in output (thus
> "exponential" programs are considered bad).
>
> If you are still not convinced, try 40 rounds - that will attempt
> to create a 1TB file. If you disk is smaller - it will become full
> (which is just like memory exhaustion).
>
> ----
>
> I hope this resolves the issue.
>
> regards,
>   - assaf
>
>
>
>
>
>
>
>
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 17 Feb 2019 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 68 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.