GNU bug report logs - #46048
split -n K/N loses data, sum of output files is smaller than input file.

Previous Next

Package: coreutils;

Reported by: Paul Hirst <contact <at> phirst.org>

Date: Sat, 23 Jan 2021 08:26:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 46048 in the body.
You can then email your comments to 46048 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#46048; Package coreutils. (Sat, 23 Jan 2021 08:26:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Hirst <contact <at> phirst.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 23 Jan 2021 08:26:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Hirst <contact <at> phirst.org>
To: bug-coreutils <at> gnu.org
Subject: split -n K/N loses data,
 sum of output files is smaller than input file.
Date: Fri, 22 Jan 2021 18:58:03 -1000
[Message part 1 (text/plain, inline)]
split --number K/N appears to lose data in, with the sum of the sizes of
the output files being smaller than the original input file by 131072 bytes.

$ split --version
split (GNU coreutils) 8.30
...

$ head -c 1000000 < /dev/urandom > test.dat
$ split --number=1/4 test.dat > t1
$ split --number=2/4 test.dat > t2
$ split --number=3/4 test.dat > t3
$ split --number=4/4 test.dat > t4

$ ls -l
-rw-r--r-- 1 user user  250000 Jan 22 18:36 t1
-rw-r--r-- 1 user user  250000 Jan 22 18:36 t2
-rw-r--r-- 1 user user  250000 Jan 22 18:36 t3
-rw-r--r-- 1 user user  118928 Jan 22 18:36 t4
-rw-r--r-- 1 user user 1000000 Jan 22 18:33 test.dat

Surely this should not be the case?

Paul
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#46048; Package coreutils. (Sun, 24 Jan 2021 16:54:01 GMT) Full text and rfc822 format available.

Message #8 received at 46048 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Hirst <contact <at> phirst.org>, 46048 <at> debbugs.gnu.org
Subject: Re: bug#46048: split -n K/N loses data, sum of output files is
 smaller than input file.
Date: Sun, 24 Jan 2021 16:52:57 +0000
On 23/01/2021 04:58, Paul Hirst wrote:
> split --number K/N appears to lose data in, with the sum of the sizes of
> the output files being smaller than the original input file by 131072 bytes.
> 
> $ split --version
> split (GNU coreutils) 8.30
> ...
> 
> $ head -c 1000000 < /dev/urandom > test.dat
> $ split --number=1/4 test.dat > t1
> $ split --number=2/4 test.dat > t2
> $ split --number=3/4 test.dat > t3
> $ split --number=4/4 test.dat > t4
> 
> $ ls -l
> -rw-r--r-- 1 user user  250000 Jan 22 18:36 t1
> -rw-r--r-- 1 user user  250000 Jan 22 18:36 t2
> -rw-r--r-- 1 user user  250000 Jan 22 18:36 t3
> -rw-r--r-- 1 user user  118928 Jan 22 18:36 t4
> -rw-r--r-- 1 user user 1000000 Jan 22 18:33 test.dat
> 
> Surely this should not be the case?

Ugh. This functionality was broken for all files > 128KiB
due to adjustments for handling /dev/zero

$ truncate -s 1000000 test.dat
$ split --number=4/4 test.dat | wc -c
118928

The following patch fixes it here.
I need to do some more testing, before committing.

thanks!

diff --git a/src/split.c b/src/split.c
index 0660da13f..6aa8d50e9 100644
--- a/src/split.c
+++ b/src/split.c
@@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize,
     }
   else
     {
-      if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+      if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
         die (EXIT_FAILURE, errno, "%s", quotef (infile));
       initial_read = SIZE_MAX;
     }




Information forwarded to bug-coreutils <at> gnu.org:
bug#46048; Package coreutils. (Sun, 24 Jan 2021 16:59:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#46048: split -n K/N loses data, sum of output files is
 smaller than input file.
Date: Sun, 24 Jan 2021 16:58:43 +0000
On 24/01/2021 16:52, Pádraig Brady wrote:
> diff --git a/src/split.c b/src/split.c
> index 0660da13f..6aa8d50e9 100644
> --- a/src/split.c
> +++ b/src/split.c
> @@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize,
>        }
>      else
>        {
> -      if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
> +      if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
>            die (EXIT_FAILURE, errno, "%s", quotef (infile));
>          initial_read = SIZE_MAX;
>        }

The same adjustment is needed in lines_chunk_split()
I'll add a test also.

cheers,
Pádraig





Information forwarded to bug-coreutils <at> gnu.org:
bug#46048; Package coreutils. (Sun, 24 Jan 2021 19:56:01 GMT) Full text and rfc822 format available.

Message #14 received at 46048 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: Paul Hirst <contact <at> phirst.org>, 46048 <at> debbugs.gnu.org
Subject: Re: bug#46048: split -n K/N loses data, sum of output files is
 smaller than input file.
Date: Sun, 24 Jan 2021 11:55:27 -0800
On 1/24/21 8:52 AM, Pádraig Brady wrote:
> -      if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
> +      if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)

Dumb question: will this handle the case where you're splitting from 
stdin and stdin is a seekable file and its initial file offset is nonzero?




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Mon, 25 Jan 2021 14:22:02 GMT) Full text and rfc822 format available.

Notification sent to Paul Hirst <contact <at> phirst.org>:
bug acknowledged by developer. (Mon, 25 Jan 2021 14:22:02 GMT) Full text and rfc822 format available.

Message #19 received at 46048-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Paul Hirst <contact <at> phirst.org>, 46048-done <at> debbugs.gnu.org
Subject: Re: bug#46048: split -n K/N loses data, sum of output files is
 smaller than input file.
Date: Mon, 25 Jan 2021 14:21:35 +0000
[Message part 1 (text/plain, inline)]
On 24/01/2021 19:55, Paul Eggert wrote:
> On 1/24/21 8:52 AM, Pádraig Brady wrote:
>> -      if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
>> +      if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
> 
> Dumb question: will this handle the case where you're splitting from
> stdin and stdin is a seekable file and its initial file offset is nonzero?

Right. Following on the logic from input_file_size(),
I'm going with the attached, which I'll push later.
Marking this as done.

thanks,
Pádraig
[split-k_of_n.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#46048; Package coreutils. (Mon, 08 Feb 2021 13:55:01 GMT) Full text and rfc822 format available.

Message #22 received at 46048 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 46048 <at> debbugs.gnu.org
Cc: Paul Hirst <contact <at> phirst.org>
Subject: Re: bug#46048: split -n K/N loses data, sum of output files is
 smaller than input file.
Date: Mon, 8 Feb 2021 13:54:27 +0000
On 25/01/2021 14:21, Pádraig Brady wrote:
> On 24/01/2021 19:55, Paul Eggert wrote:
>> On 1/24/21 8:52 AM, Pádraig Brady wrote:
>>> -      if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
>>> +      if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
>>
>> Dumb question: will this handle the case where you're splitting from
>> stdin and stdin is a seekable file and its initial file offset is nonzero?
> 
> Right. Following on the logic from input_file_size(),
> I'm going with the attached, which I'll push later.
> Marking this as done.

Note this fix has now propagated to Fedora builds,
and is in the process of propagating to RHEL/Centos.

I've just logged a debian bug also:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=982300

cheers,
Pádraig




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 09 Mar 2021 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 20 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.